Hunting Performance in Python Code – Part 3. CPU Profiling (scripts)

In this post I will talk about some tools that can help us solve another painful problem in Python: profiling CPU usage.

CPU profiling means measuring the performance of our code by analyzing the way CPU executes the code. This translates to finding the hot spots in our code and see how we can deal with them.

We will see next how you can trace the CPU usage used by your Python scripts. We will focus on the following profilers (click them to go to the corresponding section in this blog):

Series index

The links below will go live once the posts are released:

  1. Setup
  2. Memory Profiling
  3. CPU Profiling – Python Scripts
  4. CPU Profiling – Python Interpreter

Measuring CPU Usage

For this post I will use mostly the same script as the one used in the memory analysis posts, and you can see it bellow or here.

Also, remember that on PyPy2, you need to use a pip version that works with it:

pypy -m ensure pip

and anything else will be installed using:

pypy -m pip install

cProfile

One of the most used tools when talking about CPU profiling is cProfile, mainly because it comes builtin in CPython2 and PyPy2. It is a deterministic profiler, meaning that it will collect a set of statistic data when running our workload, such as execution count or execution time of various portions of our code. Furthermore, cProfile has a lower overhead on a system than other builtin profilers (profile or hotshot).

The usage is fairy simple when using CPython2:

python -m cProfile 03.primes-v1.py

If you are using PyPy2:

pypy -m cProfile 03.primes-v1.py

And the output of this is as follows:

Even with this text output, it is straightforward to see that a lot of times our script calls the list.append method.

We can see the cProfile output in a graphical way if we use gprof2dot. To use it, we must first install graphviz as a requirement and after that the package itself. On Ubuntu give the commands:

apt-get install graphviz
pip install gprof2dot

We run our script again:

python -m cProfile -o output.pstats 03.primes-v1.py
gprof2dot -f pstats output.pstats | dot -Tpng -o output.png

And we get the following output.png file:

output

More easier to see everything this way. Let’s take a closer look at what it outputs. You see a callgraph of functions from your script. In each square you see line by line:

  • On the first line: the Python filename, the line number and the method name
  • On the second line: the percentage of global time that this square uses
  • On the third line: in brackets, the percentage of global time spent in the method itself
  • On the fourth line: the number of calls made

For example, in the third red square from top, the method primes takes 98.28% of the time, with 65.44% doing something inside it, and it was called 40 times. The rest of the time is spent in the list.append(22.33%) and range(11.51%) methods from Python.

Being a simple script, we just have to re-write our script to not use so many appends, and it becomes like this:

If we measure the time of our script before and now with CPython2:

python 03.primes-v1.py
Benchmark duration: 15.768115043640137 seconds

python 03.primes-v2.py
Benchmark duration: 6.56312108039856 seconds

Also with PyPy2:

pypy 03.primes-v1.py
Benchmark duration: 1.4009230136871338 seconds

pypy 03.primes-v2.py
Benchmark duration: 0.4542720317840576 seconds

We get a good 2.4X improvement with CPython2 and 3.1X with PyPy2. Not bad. And the cProfile callgraph:

output

You can also use cProfile in a programatic way, such as:

import cProfile
 
pr = cProfile.Profile()
pr.enable()
 
function_to_measure()

pr.disable()
pr.print_stats(sort='time')

This is useful in some scenarios, such as multiprocess performance measurents. More can be seen here.

line_profiler

This profiler provides information at line level on a workload. It is implemented in C using Cython and has a small overhead when comparing it to cProfile.

The source code repo can be found here and the PyPI page here. It has quite an overhead, compared to cProfile, taking 12X more time to get a profile.

To use it, you need first to add it via pip: pip install pip install Cython ipython==5.4.1 line_profiler (CPython2). One major downside of this profiler, is that it does not have support for PyPy.

Just like when using memory_profiler, you need to add a decorator to the function you want to be profiled. In our case, you need to add @profile before the definition of our primes function in 03.primes-v1.py. Then call it like this:

kernprof -l 03.primes-v1.py
python -m line_profiler 03.primes-v1.py.lprof

You will get an output like this:

We see that the two loops that call repeatedly list.append take the most time from our script.

pprofile

According to the author, pprofile is a “line-granularity, thread-aware deterministic and statistic pure-python profiler”. 

It is inspired from line_profiler, fixes a lot of disadvantage, but because it is written entirely in Python, it can be used successfully with PyPy also. Compared to cProfile, an analysis takes 28X more time when using CPython, and 10X more time when using PyPy but also the level of detail is much more granular.

And we have support for PyPy! An addition to it, it’s that it has support for profiling threads, which may come handy in various situations.

To use it, you need first to add it via pip: pip install pprofile (CPython2) / pypy -m pip install pprofile (PyPy) and then call it like this:

pprofile 03.primes-v1.py

The output is different than what we seen before and we get something like this:

We now can see everything in more detail. Let’s go a bit through the output. You get the entire output of the script, and in front of each line you see the number of calls made to it, how much time was spent running it (in seconds), the time per each call and the percentage of the global time spent running it. Also, pprofile adds extra lines to our output (like line 44 and 50, which start with (call)), with the cumulated metrics.

Again, we see that the two loops that call repeatedly list.append take the most time from our script.

vprof

vprof is a Python profiler providing rich and interactive visualizations for various Python program characteristics such as running time and memory usage. It is a graphical one, based on Node.JS to display the results in a webpage.

With it, you can see one or all of the following, related to Python scripts:

  • CPU flame graph
  • code profiling
  • memory graph
  • code heatmap

To use it, you need first to add it via pip: pip install vprof (CPython2) / pypy -m pip install vprof (PyPy) and then call it like this:

On CPython2, to display the code heatmap (first call below) and the code profiling (second call below):

vprof -c h 03.primes-v1.py
vprof -c p 03.primes-v1.py

On PyPy, to display the code heatmap (first call below) and the code profiling (second call below):

pypy -m vprof -c h 03.primes-v1.py
pypy -m vprof -c p 03.primes-v1.py

On each case, you will see the following for the code heatmap

002

and the following for the code profiling.

001

The results are seen in a graphical way, and we can hover the mouse or click each line for more info.

Again, we see that the two loops that call repeatedly list.append take the most time from our script.


By Alecsandru Patrascu, alecsandru.patrascu [at] rinftech [dot] com

Advertisements

One thought on “Hunting Performance in Python Code – Part 3. CPU Profiling (scripts)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s