Neurtu: parametric benchmarks in Python


A common task in scientific code development consists in evaluating some metric (e.g. run time, memory usage) as the problem size increases to estimate time and memory complexities and implementation scalability. Or alternatively with different numerical solvers, to select the best one. In the following we will refer to this process as code benchmarking.

Problem statement

To understand why benchmarking is not straightforward with current tools available in the Python ecosystem, let's start with the simple case of evaluating the numpy.ndarray.sort function for different array sizes. The simplest code that we could come up with is,

import numpy as np
import time

rng = np.random.RandomState(42)

for N in [1000, 10000, 100000]:
    X = rng.rand(N)
    t0 = time.time()
    X.sort()
    print('N=%s : %.5f s.' % (N, time.time() - t0))

which will produce

N=1000 : 0.00008 s.
N=10000 : 0.00085 s.
N=100000 : 0.00862 s.

This approach is frequently used, however, it has severe limitations. First, time.time only has a 12 ms precision on Windows, and time.clock should to be used on this platform instead, while time.time remains better for Linux / Mac OS. Additionally, the garbage collection may skew the measurements.

To work around these issues the timeit module was created. Using this module, the previous code can be rewritten as follows,

import timeit

for N in [1000, 10000, 100000]:
    X = rng.rand(N)
    t0 = time()

    def _func():
        X.sort()

    n_eval = 100
    dt = timeit.timeit(_func, number=n_eval) / n_eval
    print('N=%s : %.5f s.' % (N, dt))

In this case, the main challenge is to chose the number of evaluations large enough to avoid timer precision issues, but at the same time small enough to run the benchmark in a reasonable time.

IPython's %timeit magic function is able to estimate the number of needed evaluations automatically, and one can just run %timeit X.sort(). This only works in an IPython environment, however; not in regular Python scripts where benchmarks typically reside. In a regular script, one could do,

from IPython import get_ipython
ipython = get_ipython()

ipython.magic('timeit -qo X.sum()')

where the -qo flag specifies to return the evaluated time instead of printing it to stdout. Though, the default number of repeats in %timeit is suitable for single measurements, but not so much for parametric benchmarks where hundreds of measurements would take too long with default settings. Adding IPython dependency to projects only for this functionality can also be problematic.

Next, imagine we want to estimate the memory complexity of the numpy.ndarray.sort function. One could use the memory_profiler package, but that would require understanding its API (which is different from that of the timeit module). Similarly, IPython has standardized the memory measurement with the %memit magic function, but it also suffers from the above mentioned limitations.

Finally, the question of how to represent the results in a format suitable for later analysis.

Neurtu quickstart

Neurtu was designed as a solution to the above mentioned issues, aiming to facilitate writing multi-metrics benchmarks with little effort.

If we take this array sorting example, when using neurtu, we would first write a generator of cases,

import numpy as np
import neurtu

def cases()
    rng = np.random.RandomState(42)

    for N in [1000, 10000, 100000]:
        X = rng.rand(N)
        tags = {'N' : N}
        yield neurtu.delayed(X, tags=tags).sort()

that yields a sequence of delayed calculations, each tagged with the parameters defining individual runs.

We can evaluate the run time with,

>>> df = neurtu.timeit(cases())
>>> print(df)
        wall_time
N
1000     0.000014
10000    0.000134
100000   0.001474

which will internally use timeit module with a sufficient number of evaluation to work around the timer precision limitations (similarly to IPython's %timeit). It will also display a progress bar for long running benchmarks, and return the results as a pandas.DataFrame (if pandas is installed). Plots can be made easily with the pandas plotting API and matplotlib: df.plot(marker='o', logx=True)

Neurtu benchmark example

By default, all evaluations are run with repeat=1. If more statistical confidence is required, this value can be increased,

>>> neurtu.timeit(cases(), repeat=3)
       wall_time
            mean       max       std
N
1000    0.000012  0.000014  0.000002
10000   0.000116  0.000149  0.000029
100000  0.001323  0.001714  0.000339

In this case we will get a frame with a pandas.MultiIndex for columns, where the first level represents the metric name (wall_time) and the second -- the aggregation method. By default neurtu.timeit is called with aggregate=['mean', 'max', 'std'] methods, as supported by the pandas aggregation API. To disable, aggregation and obtains timings for individual runs, use aggregate=False. See neurtu.timeit documentation for more details.

To evaluate the peak memory usage, one can use the neurtu.memit function with the same API,

>>> neurtu.memit(cases(), repeat=3)
        peak_memory
               mean  max  std
N
10000           0.0  0.0  0.0
100000          0.0  0.0  0.0
1000000         0.0  0.0  0.0

More generally neurtu.Benchmark supports a wide number of evaluation metrics,

>>> bench = neurtu.Benchmark(wall_time=True, cpu_time=True, peak_memory=True)
>>> bench(cases)
         cpu_time  peak_memory  wall_time
N
10000    0.000100          0.0   0.000142
100000   0.001149          0.0   0.001680
1000000  0.013677          0.0   0.018347

including psutil process metrics.

To find out more about neurtu, see,