Enabling Profile Guided Optimizations for PyPy

PyPy, compared to CPython relies more on achieving speed-up by “jitting” code as often as possible, rather than rely on its interpreter. However, jitting is not always an option, or at least not entirely. A good improvement for CPython, that we think might benefit PyPy as well, without impacting the JIT performance is Profile Guided Optimization (PGO or profopt).

I thank the PyPy developer community for their patience, kind advice and constant feedback they gave me in #pypy IRC or through email, which helped me to make this possible, especially to Carl Friedrich Bolz-Tereick and Armin Rigo.

1. Introduction

Profile-guided optimizations can differ in implementation from compiler to compiler, but all of them, basically do the 3 steps:

  • First, the source code is instrumented during compilation. It is worth noting that not all associated libraries need to be instrumented, but for best performance it is advisable to do so.
  • Secondly, the binary that results from the first step will have some associated “profiles”, that will be updated every time you run it (this behavior for gcc, at least). This step is also known as the training phase. It is crucial here that you run workloads that are the most relevant for the code paths you usually expect / want to be taken. It is also important to remember that altering the code past this point would cause inconsistencies in your profiles, and therefore poorer performance.
  • The third phase is cleaning everything but the profiles, and recompile everything based on the training they had. This will unvectorize small loops, ease inlining, improve branch prediction, improve hotspots etc.

So, how does all this benefit Python or PyPy? Well, since Python is an interpreted language, written in C, it is obvious that training its binary with common scenarios will benefit it greatly. It is worth mentioning though, that the training should not be everything that CPython can be used for. Simply put, if everything is a hot spot then nothing is. On the contrary, it would only make performance worse.

How about PyPy?

PyPy is a different, because it already has the benefit of a JIT, so it does not rely on the interpreter as much as Python does. The underlying issue here is that the Assembly code generated by the JIT has no way of benefiting from PGO, because it is was never instrumented. However, the interpreted code itself is roughly 3 times slower that CPython, mostly due to the instrumentation of the code it needs to do to start the JIT.

The target here was then to improve the interpreter of PyPy by compiling it with PGO, while also, avoid delaying the JIT, its main source of performance.

Ideally, we hoped for a speed-up similar to CPython, of about 10% on average, but we were aware that imperfect training would always introduce delays.

2. PGO for PyPy

Since PyPy is written in RPython, enabling the Profile-Guided optimizations for it is a bit more tricky than for CPython. This is mainly due to the fact that PyPy need to be translated and have it sources generated before they can be compiled.

However, to enable PGO we only needed the Makefile generated after the translation, so simply saved the sources and worked on them, in order to save time.

Therefore we altered the Makefile to create a new target for our profile optimized binary, by adding to the usual GCC command, the --fprofile-generate flag for the first phase at both compilation and link phase, than train the resulting binary. Afterwards, we used the profiles to rebuild the project once again, by using:  ​--fprofile-use --fprofile-correctionagain for both compilation and linking.

As it is quite hard, and also subjective, to determine a good training set for interpreters in general, we decided to start with the training tests that CPython uses and see whether this offers any performance improvement, and to have a baseline to compare any subsequent testing sets.

While we have more to explain / debate about how would profopt best work for PyPy, such as tips and tricks, different workloads (we also tried the actual benchmark as a training set 🙂 ), or what are the proper use cases for it, we will focus on the actual performance gains and how measured this.

An advantage of the implementation from PyPy, is that it’s not for the Python interpreter only, but it can be used for any existing or future implementation that are based on RPython!

Stay tuned for a more general implementation of PGO for both GCC and CLANG.

3. Usage

So here are the steps you should take to enable PGO for PyPy. As a sidenote, I need to mention that the tests have been performed on Ubuntu 16.04, with gcc 6.2.0. It is also important to mention that if you have a Darwin/Mac (I am working to enable this for CLANG on Mac) or Windows, you will most likely not be able to use PGO.

Clone the PyPy repo:

hg clone http://bitbucket.org/pypy/pypy pypy

Install dependencies:

apt-get install gcc make libffi-dev pkg-config libz-dev libbz2-dev \
libsqlite3-dev libncurses-dev libexpat1-dev libssl-dev libgdbm-dev \
tk-dev libgc-dev python-cffi \
liblzma-dev  # For lzma on PyPy3.

Go to the clone and run:

cd pypy/goal
# If you want to enable profopt for PyPy without too much hassle:
python ../../rpython/bin/rpython --opt=jit --cc=/opt/gcc-6.2.0/gcc --profopt

Or, if you want to specify the training script for PGO yourself, you can specify the profoptargs argument, which will take the absolute path to your script and the arguments it requires as well. For example, running the exact same script as in the previous case, but make the script use more cores:

cd pypy/goal
python ../../rpython/bin/rpython --opt=jit --cc=/opt/gcc-6.2.0/gcc \ 
--profopt --profoptargs="/home/md/pypy/lib-python/2.7/test/regrtest.py --pgo -j 18 -x test_asyncore test_gdb test_multiprocessing test_subprocess || true"

# By default the script above runs on 1 process, 
# while now the script will run on 18 processes. 
# I would advise to use the number of the cores you have for this.

# Side Note: the "|| true" at the end ensures that the training finishes 
# successfully, as some of the regrtests fail for PyPy (12 out of 400)

Now,  the translation process with PGO takes ~ 1h and 15 min, so grab a cup of coffee, enjoy the mandelbrot, etc. When it ends you should have your pypy-c binary and the associated libpypy-c.so in the goal directory, trained and ready for use.

On the other hand, you can apply the same concept to any binary that results from an RPythton translation. However, in their case there is, obviously, no default, so both the script and its arguments are required. Consider this example:

cd rpython/translator/goal
../../bin/rpython --profopt --profoptargs=1000 targetrpystonedalone.py

There is no actual training script for the resulting binary, as rpystone is not an interpreter, but rather a binary that takes an integer value as a parameter. To get a clear picture of this, any training is conceptually run as follows:

 ​​ ./your_binary arguments_from_profoptargs

Therefore, in the case of PyPy, since it is a Python interpreter:

./pypy-c /path/to/training/script.py arguments_of_the script

4. Measurements

Measuring performance gains is not easy for several reasons (situations we have actually encountered):

  • There is no standard benchmark. Or, it is inherently unfair to your setup.
    • While there is a standard test suite for Python, called pyperformance, we have found that results can have differences that are deemed significant even from one run to another. Obviously, multiple runs of the same workload, with the same binary should not have relevant differences. It is also quite unfair for PyPy, because there is no warmup time, which usually takes a small amount of time, but makes a world of difference in the results. This problem brings us to the next point.
  • Not all the tests in a benchmark are as reliable as you would like
    • After pyperformance, we have decided to try a similar benchmark, implemented in the PyPy project (https://bitbucket.org/pypy/benchmarks/src) that is similar to pyperformance, except it actually does have a warmup phase so that the JIT can fully get in effect before the measurement starts. Even so, however, not all the tests have a low jitter in run-to-run variation. For example, in this benchmark, the translation test is a good example of a test that varies wildly because of its unusually long times, and due to the fact that it is not repeated. This is problem because in these certain cases, you might be unsure at what is your actual speedup or if you have any. The best way to make sure that you actually have a speedup, if you encounter such a scenario, is to repeat the tests, and calculate, statistically the coeficient of variance and the standard deviation.
  • Most of your tests show a good speedup, but one of them is a lot slower.
    • Such a scenario happened after we processed the results. We realized that most tests showed an average improvement of ~6% (which was quite good for a interpreter binary), but there was one of them (nbody simulations) which had almost 50% slowdown. At this point, it is nice if you have a strategy for such cases. In our case, we talked with the PyPy devs and they we satisfied with the results. However, their proposal was to leave profopt as an option to be enabled rather than it being the default setting. This made sense for both us and them, because whomever might use the interpreter should be able to be fully aware of the eventual shortcomings of enabling profile-guided optimizations: it works better in most cases, by an average of 6%, but if you do nbody simulations, you will be disappointed.

An excerpt of our results is in the table below. You can also see it in more detail, here (https://docs.google.com/spreadsheets/d/1aEUkgUcEXGSieBnn82_vVzORk9fRfdW2UKJlE2jFZCk/edit?usp=sharing).

Benchmark SPEEDUP vs 5.8.0 (%) SPEEDUP vs 5.7.1(%)
ai -6,59514475 -1,653259733
bm_chameleon 7,335724952 4,253695178
bm_dulwich_log 11,17004467 11,52603874
bm_krakatau 2,223751046 9,830767116
bm_mako 10,16285085 8,441075993
bm_mdp 0,9954771658 3,8279006
chaos 12,96699933 9,662335477
sphinx 10,96162741 13,53919982
crypto_pyaes 6,87305902 5,705374238
deltablue 11,04547752 9,751786827
django -0,8029511953 -1,195662938
eparse 14,26472951 7,140076383
fannkuch 2,047509396 9,255620128
float -5,041630301 -5,355890836
genshi_text 10,49601345 7,625449172
genshi_xml 10,71362264 8,568584358
go 8,593695776 11,19313783
hexiom2 -2,727359392 0,2730693704
html5lib 12,48909146 14,11072664
json_bench -2,706452658 2,043617202
meteor-contest 0,4124628795 2,599997918
nbody_modified -46,20906471 -44,00816219
nqueens 5,313147334 4,335091798
pidigits -1,218052708 1,625006582
pyflate-fast 2,184923491 11,29683696
pypy_interp 8,946516802 6,795118756
pyxl_bench 14,72355069 12,38676536
raytrace-simple 3,560970911 2,278664684
richards 21,44062274 11,0766451
rietveld 16,48161 15,74474724
scimark_fft 0,08357208261 1,170341048
scimark_lu -0,001925930003 0,384165752
scimark_montecarlo -0,1569076222 7,118931795
scimark_sor 0,02145168009 3,88094651
scimark_sparsematmult 0,1862575938 0,6585966432
slowspitfire 16,02340539 17,25387427
spambayes 13,75215023 12,15884043
spectral-norm 8,845451251 6,946412753
spitfire 14,04775125 13,42281879
spitfire_cstringio -2,581369248 -14,53634085
sqlalchemy_declarative 13,80427176 10,85188974
sqlalchemy_imperative 14,22442979 10,76778705
sqlitesynth -6,859216967 -5,760158131
sympy_expand 14,5654882 10,96270114
sympy_integrate 11,66925055 9,257962543
sympy_str 16,43562223 13,26291402
sympy_sum 13,26914098 10,97091752
telco 9,838131638 7,788187003
trans2_annotate 6,882628246 10,03103295
trans2_rtype 1,341178858 1,965765439
trans2_backendopt 5,224810661 11,69271996
trans2_database 9,079338142 11,77439275
trans2_source 3,594638505 12,76569678
twisted_iteration -4,812491194 -0,9329247761
twisted_names 4,584209853 10,88909424
twisted_pb 0,1975778823 5,544581016
twisted_tcp 1,073193371 -0,3507019983
Average Speedup 5,264198755 5,468945967

By Mihai Dodan. E-mail: mihai [dot] dodan [at] rinftech.com

2 thoughts on “Enabling Profile Guided Optimizations for PyPy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s