Why does Jupyter %timeit report the mean instead of the best?

Question

I would like to measure the execution time of various code snippets in Python using Jupyter notebooks. Jupyter notebooks offer the %timeit and %%timeit magic to measure the execution time of a cell.

In the Jupyter documentation it states

-r<R>: number of repeats <R>, each consisting of <N> loops, and take the best result. Default: 7

Which would indicate the best-of-N should be reported, which is in line with general profiling best practices. The timeit.repeat documentation states this explicitly:

Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

However when we execute a cell in Jupyter with the %timeit magic, it reports mean ± std:

In [1]: %timeit pass
8.26 ns ± 0.12 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

In [2]: u = None

In [3]: %timeit u is None
29.9 ns ± 0.643 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

There is a significant difference between taking the mean of multiple runs and taking the min:

for approach in approaches:
            function = partial(approach, *data)
            times = timeit.Timer(function).repeat(repeat=number_of_repetitions, number=10)
            approach_times_min[approach].append(min(times))
            approach_times_mean[approach].append(mean(times))

This can lead to wrong conclusions.

Why is the behaviour of %timeit different than what is stated in the documentation? Can I change a setting to make it report the minimum instead?

What it does it finds best result 7 times. Each time doing 1000000 or so tests. Then it makes mean out of those 7 best results and std. — dankal444, Jul 12 '23 at 09:35
if you really want only one such run, just use `-r 1` as documentation says — dankal444, Jul 12 '23 at 09:36
["Whenever you do a statistical experiment (in this case a timing experiment) you want to repeat (or replicate) the experiment in order to be able to quantify uncertainty."](https://stackoverflow.com/a/71474768/4601890) — dankal444, Jul 12 '23 at 09:47
I don't want to have just one sample run, because that could lead to wrong conclusions. I want to have multiple runs and have the minimum reported. This is IMO the only sane thing — Sebastian Wozny, Jul 12 '23 at 09:49
This is not a statistical experiment @dankal444 . When measuring runtime of code execution you should not take the mean and calculate std. I'll edit the question to show a better reference for this. — Sebastian Wozny, Jul 12 '23 at 09:50

score 0 · Answer 1 · answered Jul 12 '23 at 10:06

Just do single repeat with more runs then.

BUT

I advise doing multiple repeats, because then you have a chance notice something is wrong.

If you have something running in the background, some process that will mess-up your results, it is unlikely that it will mess your results consistently. Each repeat should give different result. Thus your std of those runs will increase significantly. If std is small compared to mean, you can quite safely assume (though never be sure) that the your results are ok.

In other words, it is easy to make a mistake, but hard to make exactly same mistake 7 or so times.

Why does Jupyter %timeit report the mean instead of the best?

1 Answers1

BUT