Hunting Performance in Python Code – Part 1. Environment Setup

This is the first post in the series “Hunting Performance in Python Code”. Through each post I’ll present some of the tools and profilers that exists for Python code and how each of them helps you to better find bottlenecks both in frontend (Python scripts) and/or in the backend (Python interpreter).

Series index

The links below will go live once the posts are released:

  1. Setup
  2. Memory Profiling
  3. CPU Profiling – Python Scripts
  4. CPU Profiling – Python Interpreter

Setup

Before diving into benchmarking and profiling, first we need a proper environment. This means that both the machine and the operating system must be configured for this task.

As a general view, my machine has the following specs:

  • Processor: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
  • Memory: 32GB
  • OS: Ubuntu 16.04 LTS
  • Kernel: 4.4.0-75-generic

The goal is to have reproducible results, thus making sure that our data is not affected by other background processes, operating system configuration or  any other hardware performance enhancing technologies.

Let’s start with the configuration of the machine that we use for profiling.

Hardware features

First of all, disable any hardware performance features. This means disable Intel Turbo Boost and Hyper Threading from BIOS/UEFI.

As presented in the official page, Turbo Boost is “a technology that automatically allows processor cores to run faster than the rated operating frequency if they’re operating below power, current, and temperature specification limits”. On the other hand, Hyper Threading is “a technology which uses processor resources more efficiently, enabling multiple threads to run on each core”, as stated here.

Good stuff that we paid for and we really want them in production. Then why is it bad to have them enabled when profiling/benchmarking? Because we don’t get reliable and reproducible results, which translates into run to run variation. Let’s see this in a small example, called primes.py, intentionally written poorly 🙂

The code is also available on GitHub here. As a dependency, you will need to run:

pip install statistics

Let’s run it in a system that has Turbo Boost and Hyper Threading enabled:

python primes.py
Benchmark duration: 1.0644240379333496 seconds
Mean duration: 0.2128755569458008 seconds
Standard deviation: 0.032928838418120374 (15.468585914964498 %)

Now on the same system, but with Turbo Boost and Hyper Threading disabled:

python primes.py
Benchmark duration: 1.2374498844146729 seconds
Mean duration: 0.12374367713928222 seconds
Standard deviation: 0.000684464852339824 (0.553131172568 %)

Observe the standard deviation in the first case – 15%. This is a HUGE value! Suppose you make an optimization that brings 6% speedup, how will you be able to distinguish between a run to run variation and your implementation?

Instead, in the second case, the variation is reduced to approx. 0.6%. Your shiny new optimization will be visible!

CPU power savings

Disable any CPU power savings and use a fixed CPU frequency. This can be done by changing the Linux power governor from intel_pstate to acpi_cpufreq.

The intel_pstate driver implements a scaling driver with an internal governor for Intel Core (Sandy Bridge and newer) processors. The acpi_cpufreq driver utilizes the ACPI Processor Performance States.

Let’s check it out first!

$ cpupower frequency-info
analyzing CPU 0:
 driver: intel_pstate
 CPUs which run at the same hardware frequency: 0
 CPUs which need to have their frequency coordinated by software: 0
 maximum transition latency: 0.97 ms.
 hardware limits: 1.20 GHz - 3.60 GHz
 available cpufreq governors: performance, powersave
 current policy: frequency should be within 1.20 GHz and 3.60 GHz.
                 The governor "powersave" may decide which speed to use
                 within this range.
 current CPU frequency is 1.20 GHz.
 boost state support:
     Supported: yes
     Active: yes

You see that the used governor is set to powersave and the CPU frequency scales between 1.20 GHz and 3.60 GHz. It is good for your personal computer or any other day to day usage, but hurts the results when doing benchmarks.

What are the possible values for the governor? If we browse the documentation we see that we can use the following:

  • performance – run the CPU at the maximum frequency.
  • powersave – run the CPU at the minimum frequency.
  • userspace – run the CPU at user specified frequencies.
  • ondemand – scales the frequency dynamically according to current load. Jumps to the highest frequency and then possibly back off as the idle time increases.
  • conservative – scales the frequency dynamically according to current load. Scales the frequency more gradually than ondemand.

What we want to use is the performance governor and set the frequency at the maximum one supported by the CPU. Something like this:

$ cpupower frequency-info
analyzing CPU 0:
 driver: acpi-cpufreq
 CPUs which run at the same hardware frequency: 0
 CPUs which need to have their frequency coordinated by software: 0
 maximum transition latency: 10.0 us.
 hardware limits: 1.20 GHz - 2.30 GHz
 available frequency steps: 2.30 GHz, 2.20 GHz, 2.10 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz
 available cpufreq governors: conservative, ondemand, userspace, powersave, performance
 current policy: frequency should be within 2.30 GHz and 2.30 GHz.
                 The governor "performance" may decide which speed to use
                 within this range.
 current CPU frequency is 2.30 GHz.
 cpufreq stats: 2.30 GHz:100.00%, 2.20 GHz:0.00%, 2.10 GHz:0.00%, 2.00 GHz:0.00%, 1.90 GHz:0.00%, 1.80 GHz:0.00%, 1.70 GHz:0.00%, 1.60 GHz:0.00%, 1.50 GHz:0.00%, 1.40 GHz:0.00%, 1.30 GHz:0.00%, 1.20 GHz:0.00% (174)
 boost state support:
     Supported: no
     Active: no

Now you are going to use the performance governor and have a fixed frequency of 2.3 GHz. This value is the maximum possible, without Turbo Boost, that can be used on a Xeon E5-2699 v3.

To set everything up, run the following commands with administrative privileges:

cpupower frequency-set -g performance
cpupower frequency-set --min 2300000 --max 2300000

If you don’t have cpupower, install it using:

sudo apt-get install linux-tools-common linux-header-`uname -r` -y

The power governor has a great impact on how a CPU is used. By default, the governor is set to automatically scale the frequency to reduce the power consumption. We do not want this on our system and we proceed by disabling it from GRUB. Just edit /boot/grub/grub.cfg (but if you do be careful that on a kernel upgrade, this will be gone) or create a new kernel entry in /etc/grub.d/40_custom. Our boot line must contain the following flag: intel_pstate=disable, like this:

linux   /boot/vmlinuz-4.4.0-78-generic.efi.signed root=UUID=86097ec1-3fa4-4d00-97c7-3bf91787be83 ro intel_pstate=disable quiet splash $vt_handoff

ASLR (Address Space Layout Randomizer)

This setting is controverted, as you can see also on Victor Stinner’s post. When I first suggested to disable ASLR when doing benchmarks, it was in the context of further improving the support for Profile Guided Optimizations existing in CPython at that time.

What lead me to state that is the fact that on the particular hardware presented above, disabling ASLR reduces run to run variation to 0.4%!

On the other hand, testing this on my personal computer (which has an Intel Core i7 4710MQ), disabling ASLR lead to the same issues presented by Victor. Testing on even smaller CPUs (like an Intel Atom) led me to an even more run to run variation, instead of canceling it.

Since it seems that it is not a general available truth and greatly depends on the hardware/software configuration, the outcome of this is to leave it enabled and measure, disable it and measure again and then compare results.

On my machine I have it disabled globally by adding the following in /etc/sysctl.conf. Apply using sudo sysctl -p.

kernel.randomize_va_space = 0

If you want to disable it at runtime:

sudo bash -c 'echo 0 >| /proc/sys/kernel/randomize_va_space'

If you want to enable it back:

sudo bash -c 'echo 2 >| /proc/sys/kernel/randomize_va_space'

By Alecsandru Patrascu, alecsandru.patrascu [at] rinftech [dot] com

Advertisements

One thought on “Hunting Performance in Python Code – Part 1. Environment Setup

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s