1
  • Pandas 1.0.5
  • Python 3.8.0
  • Numpy 1.19.0

This code behaves strange:

import pandas as pd

def calc(row):
    print(f"Row: {row.to_list()}")
    result = pd.Series({
        "sum1": row.col1 + row.col2,
        "sum2": row.col2 + row.col3,
        "sum3": row.col1 + row.col3,
    })
    return result
    
df = pd.DataFrame({"col1":[1,2,3], 
                   "col2":[4,5,6], 
                   "col3":[7,8,9]})

df[["sum12", "sum23", "sum13"]] = df.apply(lambda row: calc(row), axis=1)
print(df)

It returns

Row: [1, 4, 7]
Row: [1, 4, 7]
Row: [2, 5, 8]
Row: [3, 6, 9]

    col1    col2    col3    sum12   sum23   sum13
0    1       4       7       5       11      8
1    2       5       8       7       13     10
2    3       6       9       9       15     12

First question:

Why is the first row elaborated twice?

Second question possibly linked to the first:

In my real code the elaboration of the first row takes 0.15 seconds (read by time.process_time()), the following rows between 0.53 and 0.60. The first row is elaborated twice, first time 0.15 secs, second time 0.55 secs.

What could be the reason, as data are uniform, only numpy is used in calc(), and there are no conditionals nor data filters involved?

Alex Poca
  • 2,406
  • 4
  • 25
  • 47
  • 2
    First (and most obvious) question, what is your pandas version? [This is a known issue and has been fixed in pandas 0.25](https://stackoverflow.com/a/56215416/4909087). – cs95 Jul 14 '20 at 09:55
  • Added versions. I am working with the most recent ones. It looks like the bug has not been solved completely, or it is unrelated to `group`. Thank you for pointing it out. – Alex Poca Jul 14 '20 at 09:59
  • Oops, just fact checking myself: that link was for groupby.apply only. – cs95 Jul 14 '20 at 10:20

1 Answers1

1

This is a known issue with both GroupBy.apply (pandas < 0.25) and df.apply (pandas < 1.1). The reason the first group is evaluated twice is because apply wants to know whether it can "optimize" the calculation (sometimes this is possible if apply receives a numpy or cythonized function).

With pandas 0.25, this behavior was fixed for GroupBy.apply. See here. Now with pandas 1.1, the same behavior will be fixed for df.apply.

When 1.1 is out, you'll be able to upgrade and then you will only see the first group evaluated only once:

pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'

df[["sum12", "sum23", "sum13"]] = df.apply(lambda row: calc(row), axis=1)
print(df)
Row: [1, 4, 7]
Row: [2, 5, 8]
Row: [3, 6, 9]
   col1  col2  col3  sum12  sum23  sum13
0     1     4     7      5     11      8
1     2     5     8      7     13     10
2     3     6     9      9     15     12
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thank you. It sound reasonable. BUt why the following executions takes LONGER then? I used a single execution to elaborate the calculations, and then saw that the real loop takes 5 times longer :-( – Alex Poca Jul 14 '20 at 10:02
  • 1
    @AlexPoca I'm not a pandas dev, so your guess is as good as mine. It may have something to do with pandas testing for possible fastpaths in the code. It could also be linked to caching, if the second time is shorter than the first. Etc. – cs95 Jul 14 '20 at 10:04