1

Consider the following code:

import numpy as np
import pandas as pd
a = pd.DataFrame({'case': np.arange(10000) % 100,
                  'x': np.random.rand(10000) > 0.5})
%timeit any(a.x)
%timeit a.x.max()
%timeit a.groupby('case').x.transform(any)
%timeit a.groupby('case').x.transform(max)

13.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
195 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.9 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.43 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

b = pd.DataFrame({'x': np.random.rand(100) > 0.5})
%timeit any(b.x)
%timeit b.x.max()

13.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.5 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We see that "any" works faster than "max" on a boolean pandas.Series of size 100 and 10000, but when we try to groupby and transform data in groups of 100, suddenly "max" is a lot faster than "any". Why?

Viktoriya Malyasova
  • 1,343
  • 1
  • 11
  • 25
  • 4
    In the transform one try `x.transform('any')` as a string. Otherwise it will use the python function. And as I know the vectorized ones do not short circuit. – ayhan Apr 30 '19 at 14:15
  • It's called *"Short Circuit Evaluation"*, see https://en.wikipedia.org/wiki/Short-circuit_evaluation – Mark Setchell Jun 20 '19 at 19:50

2 Answers2

3

Because any evaluation is lazy. Which means that the that the any function will stop at the first True boolean element.

The max, however, can't do so because it required to inspect every element in a sequence to be sure it haven't missed any greater element.

That's why, max always will inspect all element when any inspect only element before the first True.

The case when max works faster are probably the cases with type coercion because all values in numpy are stored in their own types and formats, mathematical operations may be faster that python's any.

0

As said in comment, the python any fonction have a short circuit mechanism, when np.any have not. see here.

But True in a.x is even faster:

 %timeit any(a.x)
53.6 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit True in (a.x)
3.39 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
B. M.
  • 18,243
  • 2
  • 35
  • 54