# Neurtu: parametric benchmarks in Python

A common task in scientific code development consists in evaluating some metric (e.g. run time, memory usage) as the problem size increases to estimate time and memory complexities and implementation scalability. Or alternatively with different numerical solvers, to select the best one. In the following we will refer to this process as code benchmarking.

#### Problem statement

To understand why benchmarking is not straightforward with current tools available in the Python ecosystem, let's start with the simple case of evaluating the `numpy.ndarray.sort` function for different array sizes. The simplest code that we could come up with is,

``````import numpy as np
import time

rng = np.random.RandomState(42)

for N in [1000, 10000, 100000]:
X = rng.rand(N)
t0 = time.time()
X.sort()
print('N=%s : %.5f s.' % (N, time.time() - t0))
``````

which will produce

``````N=1000 : 0.00008 s.
N=10000 : 0.00085 s.
N=100000 : 0.00862 s.
``````

This approach is frequently used, however, it has severe limitations. First, `time.time` only has a 12 ms precision on Windows, and `time.clock` should to be used on this platform instead, while `time.time` remains better for Linux / Mac OS. Additionally, the garbage collection may skew the measurements.

To work around these issues the `timeit` module was created. Using this module, the previous code can be rewritten as follows,

``````import timeit

for N in [1000, 10000, 100000]:
X = rng.rand(N)
t0 = time()

def _func():
X.sort()

n_eval = 100
dt = timeit.timeit(_func, number=n_eval) / n_eval
print('N=%s : %.5f s.' % (N, dt))
``````

In this case, the main challenge is to chose the number of evaluations large enough to avoid timer precision issues, but at the same time small enough to run the benchmark in a reasonable time.

IPython's `%timeit` magic function is able to estimate the number of needed evaluations automatically, and one can just run `%timeit X.sort()`. This only works in an IPython environment, however; not in regular Python scripts where benchmarks typically reside. In a regular script, one could do,

``````from IPython import get_ipython
ipython = get_ipython()

ipython.magic('timeit -qo X.sum()')
``````

where the `-qo` flag specifies to return the evaluated time instead of printing it to stdout. Though, the default number of repeats in `%timeit` is suitable for single measurements, but not so much for parametric benchmarks where hundreds of measurements would take too long with default settings. Adding IPython dependency to projects only for this functionality can also be problematic.

Next, imagine we want to estimate the memory complexity of the `numpy.ndarray.sort` function. One could use the memory_profiler package, but that would require understanding its API (which is different from that of the `timeit` module). Similarly, IPython has standardized the memory measurement with the `%memit` magic function, but it also suffers from the above mentioned limitations.

Finally, the question of how to represent the results in a format suitable for later analysis.

#### Neurtu quickstart

Neurtu was designed as a solution to the above mentioned issues, aiming to facilitate writing multi-metrics benchmarks with little effort.

If we take this array sorting example, when using neurtu, we would first write a generator of cases,

``````import numpy as np
import neurtu

def cases()
rng = np.random.RandomState(42)

for N in [1000, 10000, 100000]:
X = rng.rand(N)
tags = {'N' : N}
yield neurtu.delayed(X, tags=tags).sort()
``````

that yields a sequence of delayed calculations, each tagged with the parameters defining individual runs.

We can evaluate the run time with,

``````>>> df = neurtu.timeit(cases())
>>> print(df)
wall_time
N
1000     0.000014
10000    0.000134
100000   0.001474
``````

which will internally use `timeit` module with a sufficient number of evaluation to work around the timer precision limitations (similarly to IPython's `%timeit`). It will also display a progress bar for long running benchmarks, and return the results as a `pandas.DataFrame` (if pandas is installed). Plots can be made easily with the pandas plotting API and matplotlib: `df.plot(marker='o', logx=True)` By default, all evaluations are run with `repeat=1`. If more statistical confidence is required, this value can be increased,

``````>>> neurtu.timeit(cases(), repeat=3)
wall_time
mean       max       std
N
1000    0.000012  0.000014  0.000002
10000   0.000116  0.000149  0.000029
100000  0.001323  0.001714  0.000339
``````

In this case we will get a frame with a `pandas.MultiIndex` for columns, where the first level represents the metric name (`wall_time`) and the second -- the aggregation method. By default `neurtu.timeit` is called with `aggregate=['mean', 'max', 'std']` methods, as supported by the pandas aggregation API. To disable, aggregation and obtains timings for individual runs, use `aggregate=False`. See `neurtu.timeit` documentation for more details.

To evaluate the peak memory usage, one can use the `neurtu.memit` function with the same API,

``````>>> neurtu.memit(cases(), repeat=3)
peak_memory
mean  max  std
N
10000           0.0  0.0  0.0
100000          0.0  0.0  0.0
1000000         0.0  0.0  0.0
``````

More generally `neurtu.Benchmark` supports a wide number of evaluation metrics,

``````>>> bench = neurtu.Benchmark(wall_time=True, cpu_time=True, peak_memory=True)
>>> bench(cases)
cpu_time  peak_memory  wall_time
N
10000    0.000100          0.0   0.000142
100000   0.001149          0.0   0.001680
1000000  0.013677          0.0   0.018347
``````

including psutil process metrics.

To find out more about neurtu, see,