2

While trying to answer another question, I noticed that function any(), when applied within groupby(), performs equally slow regardless of the content of the dataframe. For example, it takes the same time to inspect a column of Trues and a column of Falses. The same is true of all(). This observation contradicts the assumption that any() is short-circuited.

import pandas as pd
import numpy as np
from timeit import timeit

df = pd.DataFrame({'id': np.random.randint(0, 2, 1000000), 'data': True})
timeit('df.groupby("id").any()', globals=globals(), number=100)
# 1.0371657210052945

df['data'] = False
timeit('df.groupby("id").any()', globals=globals(), number=100)
# 1.0135124520165846

Could anyone clarify if the two mentioned functions are short-circuited in Pandas?

DYZ
  • 55,249
  • 10
  • 64
  • 93
  • 1
    `df.groupby("id").agg(np.any)` gives me identical timing to above. So it might be a question about `np.any` as well. – Henry Ecker Jul 17 '21 at 21:17
  • Perhaps duplicate of [Why “numpy.any” has no short-circuit mechanism?](https://stackoverflow.com/q/45771554/15497888) If it is, that explains why `groupby.max` is so fast in the previous question as well... – Henry Ecker Jul 17 '21 at 21:20

0 Answers0