A common task in scientific code development consists in evaluating some metric (e.g. run time, memory usage) as the problem size increases to estimate time and memory complexities and implementation scalability. Or alternatively with different numerical solvers, to select the best one. In the following we will refer to this process as code benchmarking.
Problem statement
To understand why benchmarking is not straightforward with current tools available in the Python
ecosystem, let's start with the simple case of evaluating the numpy.ndarray.sort
function for different
array sizes. The simplest code that we could come up with is,
import numpy as np
import time
rng = np.random.RandomState(42)
for N in [1000, 10000, 100000]:
X = rng.rand(N)
t0 = time.time()
X.sort()
print('N=%s : %.5f s.' % (N, time.time() - t0))
which will produce
N=1000 : 0.00008 s.
N=10000 : 0.00085 s.
N=100000 : 0.00862 s.
This approach is frequently used, however, it has severe limitations.
First, time.time
only has a 12 ms precision on Windows, and time.clock
should to be used on
this platform instead, while time.time
remains better for Linux / Mac OS.
Additionally, the garbage collection may skew the measurements.
To work around these issues the timeit
module
was created. Using this module, the previous code can be rewritten as follows,
import timeit
for N in [1000, 10000, 100000]:
X = rng.rand(N)
t0 = time()
def _func():
X.sort()
n_eval = 100
dt = timeit.timeit(_func, number=n_eval) / n_eval
print('N=%s : %.5f s.' % (N, dt))
In this case, the main challenge is to chose the number of evaluations large enough to avoid timer precision issues, but at the same time small enough to run the benchmark in a reasonable time.
IPython's %timeit
magic
function is able to estimate the number
of needed evaluations automatically, and one can just run %timeit X.sort()
.
This only works in an IPython environment, however; not in
regular Python scripts where benchmarks typically reside. In a regular script, one could do,
from IPython import get_ipython
ipython = get_ipython()
ipython.magic('timeit -qo X.sum()')
where the -qo
flag specifies to return the evaluated time instead of printing it to stdout. Though,
the default number of repeats in %timeit
is
suitable for single measurements, but not so much for parametric benchmarks where hundreds of measurements
would take too long with default settings. Adding IPython dependency to projects only for this functionality can
also be problematic.
Next, imagine we want to estimate the memory complexity of the numpy.ndarray.sort
function. One could use
the memory_profiler package, but that would require understanding
its API (which is different from that of the timeit
module). Similarly, IPython has standardized the memory
measurement with the %memit
magic function, but it also suffers from the above mentioned limitations.
Finally, the question of how to represent the results in a format suitable for later analysis.
Neurtu quickstart
Neurtu was designed as a solution to the above mentioned issues, aiming to facilitate writing multi-metrics benchmarks with little effort.
If we take this array sorting example, when using neurtu, we would first write a generator of cases,
import numpy as np
import neurtu
def cases()
rng = np.random.RandomState(42)
for N in [1000, 10000, 100000]:
X = rng.rand(N)
tags = {'N' : N}
yield neurtu.delayed(X, tags=tags).sort()
that yields a sequence of delayed calculations, each tagged with the parameters defining individual runs.
We can evaluate the run time with,
>>> df = neurtu.timeit(cases())
>>> print(df)
wall_time
N
1000 0.000014
10000 0.000134
100000 0.001474
which will internally use timeit
module with a sufficient number of evaluation to work around the timer precision
limitations (similarly to IPython's %timeit
). It will also display a progress bar for long running benchmarks,
and return the results as a pandas.DataFrame
(if pandas is installed). Plots can be made easily with the
pandas plotting API and matplotlib:
df.plot(marker='o', logx=True)
By default, all evaluations are run with repeat=1
. If more statistical confidence is required, this value can
be increased,
>>> neurtu.timeit(cases(), repeat=3)
wall_time
mean max std
N
1000 0.000012 0.000014 0.000002
10000 0.000116 0.000149 0.000029
100000 0.001323 0.001714 0.000339
In this case we will get a frame with a
pandas.MultiIndex
for
columns, where the first level represents the metric name (wall_time
) and the second -- the aggregation method.
By default neurtu.timeit
is called with aggregate=['mean', 'max', 'std']
methods, as supported
by the pandas aggregation API. To disable,
aggregation and obtains timings for individual runs, use aggregate=False
.
See neurtu.timeit
documentation for more details.
To evaluate the peak memory usage, one can use the neurtu.memit
function with the same API,
>>> neurtu.memit(cases(), repeat=3)
peak_memory
mean max std
N
10000 0.0 0.0 0.0
100000 0.0 0.0 0.0
1000000 0.0 0.0 0.0
More generally neurtu.Benchmark
supports a wide number of evaluation metrics,
>>> bench = neurtu.Benchmark(wall_time=True, cpu_time=True, peak_memory=True)
>>> bench(cases)
cpu_time peak_memory wall_time
N
10000 0.000100 0.0 0.000142
100000 0.001149 0.0 0.001680
1000000 0.013677 0.0 0.018347
including psutil process metrics.
To find out more about neurtu, see,
- Github home page: github.com/symerio/neurtu
- documentation : neurtu.readthedocs.io/