- Pandas 1.0.5
- Python 3.8.0
- Numpy 1.19.0
This code behaves strange:
import pandas as pd
def calc(row):
print(f"Row: {row.to_list()}")
result = pd.Series({
"sum1": row.col1 + row.col2,
"sum2": row.col2 + row.col3,
"sum3": row.col1 + row.col3,
})
return result
df = pd.DataFrame({"col1":[1,2,3],
"col2":[4,5,6],
"col3":[7,8,9]})
df[["sum12", "sum23", "sum13"]] = df.apply(lambda row: calc(row), axis=1)
print(df)
It returns
Row: [1, 4, 7]
Row: [1, 4, 7]
Row: [2, 5, 8]
Row: [3, 6, 9]
col1 col2 col3 sum12 sum23 sum13
0 1 4 7 5 11 8
1 2 5 8 7 13 10
2 3 6 9 9 15 12
First question:
Why is the first row elaborated twice?
Second question possibly linked to the first:
In my real code the elaboration of the first row takes 0.15 seconds (read by time.process_time()
), the following rows between 0.53 and 0.60. The first row is elaborated twice, first time 0.15 secs, second time 0.55 secs.
What could be the reason, as data are uniform, only numpy
is used in calc()
, and there are no conditionals nor data filters involved?