A common task in scientific code development consists in evaluating some metric (e.g. run time, memory usage) as the problem size increases to estimate time and memory complexities and implementation scalability. Or alternatively with different numerical solvers, to select the best one. In the following we will refer to this process as code benchmarking.

#### Problem statement

To understand why benchmarking is not straightforward with current tools available in the Python
ecosystem, let's start with the simple case of evaluating the `numpy.ndarray.sort`

function for different
array sizes. The simplest code that we could come up with is,

```
import numpy as np
import time
rng = np.random.RandomState(42)
for N in [1000, 10000, 100000]:
X = rng.rand(N)
t0 = time.time()
X.sort()
print('N=%s : %.5f s.' % (N, time.time() - t0))
```

which will produce

```
N=1000 : 0.00008 s.
N=10000 : 0.00085 s.
N=100000 : 0.00862 s.
```

This approach is frequently used, however, it has severe limitations.
First, `time.time`

only has a 12 ms precision on Windows, and `time.clock`

should to be used on
this platform instead, while `time.time`

remains better for Linux / Mac OS.
Additionally, the garbage collection may skew the measurements.

To work around these issues the `timeit`

module
was created. Using this module, the previous code can be rewritten as follows,

```
import timeit
for N in [1000, 10000, 100000]:
X = rng.rand(N)
t0 = time()
def _func():
X.sort()
n_eval = 100
dt = timeit.timeit(_func, number=n_eval) / n_eval
print('N=%s : %.5f s.' % (N, dt))
```

In this case, the main challenge is to chose the number of evaluations large enough to avoid timer precision issues, but at the same time small enough to run the benchmark in a reasonable time.

IPython's `%timeit`

magic
function is able to estimate the number
of needed evaluations automatically, and one can just run `%timeit X.sort()`

.
This only works in an IPython environment, however; not in
regular Python scripts where benchmarks typically reside. In a regular script, one could do,

```
from IPython import get_ipython
ipython = get_ipython()
ipython.magic('timeit -qo X.sum()')
```

where the `-qo`

flag specifies to return the evaluated time instead of printing it to stdout. Though,
the default number of repeats in `%timeit`

is
suitable for single measurements, but not so much for parametric benchmarks where hundreds of measurements
would take too long with default settings. Adding IPython dependency to projects only for this functionality can
also be problematic.

Next, imagine we want to estimate the memory complexity of the `numpy.ndarray.sort`

function. One could use
the memory_profiler package, but that would require understanding
its API (which is different from that of the `timeit`

module). Similarly, IPython has standardized the memory
measurement with the `%memit`

magic function, but it also suffers from the above mentioned limitations.

Finally, the question of how to represent the results in a format suitable for later analysis.

#### Neurtu quickstart

Neurtu was designed as a solution to the above mentioned issues, aiming to facilitate writing multi-metrics benchmarks with little effort.

If we take this array sorting example, when using neurtu, we would first write a generator of cases,

```
import numpy as np
import neurtu
def cases()
rng = np.random.RandomState(42)
for N in [1000, 10000, 100000]:
X = rng.rand(N)
tags = {'N' : N}
yield neurtu.delayed(X, tags=tags).sort()
```

that yields a sequence of delayed calculations, each tagged with the parameters defining individual runs.

We can evaluate the run time with,

```
>>> df = neurtu.timeit(cases())
>>> print(df)
wall_time
N
1000 0.000014
10000 0.000134
100000 0.001474
```

which will internally use `timeit`

module with a sufficient number of evaluation to work around the timer precision
limitations (similarly to IPython's `%timeit`

). It will also display a progress bar for long running benchmarks,
and return the results as a `pandas.DataFrame`

(if pandas is installed). Plots can be made easily with the
pandas plotting API and matplotlib:
`df.plot(marker='o', logx=True)`

By default, all evaluations are run with `repeat=1`

. If more statistical confidence is required, this value can
be increased,

```
>>> neurtu.timeit(cases(), repeat=3)
wall_time
mean max std
N
1000 0.000012 0.000014 0.000002
10000 0.000116 0.000149 0.000029
100000 0.001323 0.001714 0.000339
```

In this case we will get a frame with a
`pandas.MultiIndex`

for
columns, where the first level represents the metric name (`wall_time`

) and the second -- the aggregation method.
By default `neurtu.timeit`

is called with `aggregate=['mean', 'max', 'std']`

methods, as supported
by the pandas aggregation API. To disable,
aggregation and obtains timings for individual runs, use `aggregate=False`

.
See `neurtu.timeit`

documentation for more details.

To evaluate the peak memory usage, one can use the `neurtu.memit`

function with the same API,

```
>>> neurtu.memit(cases(), repeat=3)
peak_memory
mean max std
N
10000 0.0 0.0 0.0
100000 0.0 0.0 0.0
1000000 0.0 0.0 0.0
```

More generally `neurtu.Benchmark`

supports a wide number of evaluation metrics,

```
>>> bench = neurtu.Benchmark(wall_time=True, cpu_time=True, peak_memory=True)
>>> bench(cases)
cpu_time peak_memory wall_time
N
10000 0.000100 0.0 0.000142
100000 0.001149 0.0 0.001680
1000000 0.013677 0.0 0.018347
```

including psutil process metrics.

To find out more about neurtu, see,

- Github home page: github.com/symerio/neurtu
- documentation : neurtu.readthedocs.io/