This is the first post in the series “Hunting Performance in Python Code”. Through each post I’ll present some of the tools and profilers that exists for Python code and how each of them helps you to better find bottlenecks both in frontend (Python scripts) and/or in the backend (Python interpreter).
The links below will go live once the posts are released:
Before diving into benchmarking and profiling, first we need a proper environment. This means that both the machine and the operating system must be configured for this task.
As a general view, my machine has the following specs:
- Processor: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
- Memory: 32GB
- OS: Ubuntu 16.04 LTS
- Kernel: 4.4.0-75-generic
The goal is to have reproducible results, thus making sure that our data is not affected by other background processes, operating system configuration or any other hardware performance enhancing technologies.
Let’s start with the configuration of the machine that we use for profiling.
First of all, disable any hardware performance features. This means disable Intel Turbo Boost and Hyper Threading from BIOS/UEFI.
As presented in the official page, Turbo Boost is “a technology that automatically allows processor cores to run faster than the rated operating frequency if they’re operating below power, current, and temperature specification limits”. On the other hand, Hyper Threading is “a technology which uses processor resources more efficiently, enabling multiple threads to run on each core”, as stated here.
Good stuff that we paid for and we really want them in production. Then why is it bad to have them enabled when profiling/benchmarking? Because we don’t get reliable and reproducible results, which translates into run to run variation. Let’s see this in a small example, called primes.py, intentionally written poorly 🙂
The code is also available on GitHub here. As a dependency, you will need to run:
pip install statistics
Let’s run it in a system that has Turbo Boost and Hyper Threading enabled:
python primes.py Benchmark duration: 1.0644240379333496 seconds Mean duration: 0.2128755569458008 seconds Standard deviation: 0.032928838418120374 (15.468585914964498 %)
Now on the same system, but with Turbo Boost and Hyper Threading disabled:
python primes.py Benchmark duration: 1.2374498844146729 seconds Mean duration: 0.12374367713928222 seconds Standard deviation: 0.000684464852339824 (0.553131172568 %)
Observe the standard deviation in the first case – 15%. This is a HUGE value! Suppose you make an optimization that brings 6% speedup, how will you be able to distinguish between a run to run variation and your implementation?
Instead, in the second case, the variation is reduced to approx. 0.6%. Your shiny new optimization will be visible!
CPU power savings
Disable any CPU power savings and use a fixed CPU frequency. This can be done by changing the Linux power governor from
intel_pstate driver implements a scaling driver with an internal governor for Intel Core (Sandy Bridge and newer) processors. The
acpi_cpufreq driver utilizes the ACPI Processor Performance States.
Let’s check it out first!
$ cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: 0.97 ms. hardware limits: 1.20 GHz - 3.60 GHz available cpufreq governors: performance, powersave current policy: frequency should be within 1.20 GHz and 3.60 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency is 1.20 GHz. boost state support: Supported: yes Active: yes
You see that the used governor is set to
powersave and the CPU frequency scales between 1.20 GHz and 3.60 GHz. It is good for your personal computer or any other day to day usage, but hurts the results when doing benchmarks.
What are the possible values for the governor? If we browse the documentation we see that we can use the following:
performance– run the CPU at the maximum frequency.
powersave– run the CPU at the minimum frequency.
userspace– run the CPU at user specified frequencies.
ondemand– scales the frequency dynamically according to current load. Jumps to the highest frequency and then possibly back off as the idle time increases.
conservative– scales the frequency dynamically according to current load. Scales the frequency more gradually than ondemand.
What we want to use is the performance governor and set the frequency at the maximum one supported by the CPU. Something like this:
$ cpupower frequency-info analyzing CPU 0: driver: acpi-cpufreq CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: 10.0 us. hardware limits: 1.20 GHz - 2.30 GHz available frequency steps: 2.30 GHz, 2.20 GHz, 2.10 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz available cpufreq governors: conservative, ondemand, userspace, powersave, performance current policy: frequency should be within 2.30 GHz and 2.30 GHz. The governor "performance" may decide which speed to use within this range. current CPU frequency is 2.30 GHz. cpufreq stats: 2.30 GHz:100.00%, 2.20 GHz:0.00%, 2.10 GHz:0.00%, 2.00 GHz:0.00%, 1.90 GHz:0.00%, 1.80 GHz:0.00%, 1.70 GHz:0.00%, 1.60 GHz:0.00%, 1.50 GHz:0.00%, 1.40 GHz:0.00%, 1.30 GHz:0.00%, 1.20 GHz:0.00% (174) boost state support: Supported: no Active: no
Now you are going to use the
performance governor and have a fixed frequency of 2.3 GHz. This value is the maximum possible, without Turbo Boost, that can be used on a Xeon E5-2699 v3.
To set everything up, run the following commands with administrative privileges:
cpupower frequency-set -g performance cpupower frequency-set --min 2300000 --max 2300000
If you don’t have
cpupower, install it using:
sudo apt-get install linux-tools-common linux-header-`u
name -r` -y
The power governor has a great impact on how a CPU is used. By default, the governor is set to automatically scale the frequency to reduce the power consumption. We do not want this on our system and we proceed by disabling it from GRUB. Just edit
/boot/grub/grub.cfg (but if you do be careful that on a kernel upgrade, this will be gone) or create a new kernel entry in
/etc/grub.d/40_custom. Our boot line must contain the following flag:
intel_pstate=disable, like this:
linux /boot/vmlinuz-4.4.0-78-generic.efi.signed root=UUID=86097ec1-3fa4-4d00-97c7-3bf91787be83 ro intel_pstate=disable quiet splash $vt_handoff
ASLR (Address Space Layout Randomizer)
This setting is controverted, as you can see also on Victor Stinner’s post. When I first suggested to disable ASLR when doing benchmarks, it was in the context of further improving the support for Profile Guided Optimizations existing in CPython at that time.
What lead me to state that is the fact that on the particular hardware presented above, disabling ASLR reduces run to run variation to 0.4%!
On the other hand, testing this on my personal computer (which has an Intel Core i7 4710MQ), disabling ASLR lead to the same issues presented by Victor. Testing on even smaller CPUs (like an Intel Atom) led me to an even more run to run variation, instead of canceling it.
Since it seems that it is not a general available truth and greatly depends on the hardware/software configuration, the outcome of this is to leave it enabled and measure, disable it and measure again and then compare results.
On my machine I have it disabled globally by adding the following in
/etc/sysctl.conf. Apply using
sudo sysctl -p.
kernel.randomize_va_space = 0
If you want to disable it at runtime:
sudo bash -c 'echo 0 >| /proc/sys/kernel/randomize_va_space'
If you want to enable it back:
sudo bash -c 'echo 2 >| /proc/sys/kernel/randomize_va_space'
By Alecsandru Patrascu, alecsandru.patrascu [at] rinftech [dot] com