Issue
I'm trying to create using Pandas a new column returning the next max of the 2 next days of each id
/date
.
Below input data and the data I expect to get. I came up with a naive solution with iterrows (see bellow), but I would like to :
- find a more optimized/pythonic way to do it
- find a more generic way in order to pass the condition as method parameters to perform various different MAXIFS
n.b.1 This question is a generalization of Perform Excel MAXIFS in Pandas
n.b.2 It is equivalent to apply this Excel formula MAXIFS to each row: MAXIFS(C:C;A:A;A2;B:B;">"&B2, B:B;"<="&B2+2)
where: A = id; B = date; C = value
Input data
df = pd.DataFrame({
"id": ["a"] * 2 + ["b"] * 4 + ["a", "b"] * 2 + ["b"],
"date": pd.date_range(datetime(2023, 1, 1), periods=11).tolist(),
"value": [3, 10, 2, 20, 24, 9, 21, 7, 25, 12, 7]
})
df
# id date value
# 0 a 2023-01-01 3
# 1 a 2023-01-02 10
# 2 b 2023-01-03 2
# 3 b 2023-01-04 20
# 4 b 2023-01-05 24
# 5 b 2023-01-06 9
# 6 a 2023-01-07 21
# 7 b 2023-01-08 7
# 8 a 2023-01-09 25
# 9 b 2023-01-10 12
# 10 b 2023-01-11 7
Expected output
expected_col = [10, np.nan, 24, 24, 9, 7, 25, 12, np.nan, 7, np.nan]
df_expected = pd.concat([df, pd.DataFrame({"next_2d_max": expected_col})], axis=1)
df_expected
# id date value next_2d_max
# 0 a 2023-01-01 3 10.0
# 1 a 2023-01-02 10 NaN
# 2 b 2023-01-03 2 24.0
# 3 b 2023-01-04 20 24.0
# 4 b 2023-01-05 24 9.0
# 5 b 2023-01-06 9 7.0
# 6 a 2023-01-07 21 25.0
# 7 b 2023-01-08 7 12.0
# 8 a 2023-01-09 25 NaN
# 9 b 2023-01-10 12 7.0
# 10 b 2023-01-11 7 NaN
Naive solution
I came up with a solution by iterating df line by line to find all local max. I assume this will not scale well on multi-millions lines DateFrame. Can you help me figure a better solution?
import pandas as pd
from datetime import timedelta
def get_local_max(df, row):
local_max = df[
(df["id"] == row["id"])
& (df["date"] > row["date"])
& (df["date"] <= row["date"] + timedelta(days=2))
]["value"].max()
return local_max
def get_all_max(df):
for index, row in df.iterrows():
yield get_local_max(df, row)
df["next_2d_max"] = pd.Series([local_max for local_max in get_all_max(df)])
pd.testing.assert_frame_equal(df, df_expected)