1

Issue

I'm trying to create using Pandas a new column populated with the next high value of each id.

Below input data and the data I expect to get.

I explored where/groupby/transform as describe here: Pandas: find maximum value, when and if conditions. But it can't fit my use case because I need a where statement depending on each row, not a constant value.

n.b. It is equivalent to Excel formula MAXIFS: MAXIFS(C:C;A:A;A2;B:B;">"&B2) where: A = id; B = date; C = value

Input data

df = pd.DataFrame({
    "id": ["a"] * 2 + ["b"] * 4 + ["a", "b"] * 2 + ["b"],
    "date": pd.date_range(datetime(2023, 1, 1), periods=11).tolist(),
    "value": [3, 10, 2, 20, 24, 9, 21, 7, 25, 12, 7]
})

#    id       date  value
# 0   a 2023-01-01      3
# 1   a 2023-01-02     10
# 2   b 2023-01-03      2
# 3   b 2023-01-04     20
# 4   b 2023-01-05     24
# 5   b 2023-01-06      9
# 6   a 2023-01-07     21
# 7   b 2023-01-08      7
# 8   a 2023-01-09     25
# 9   b 2023-01-10     12
# 10  b 2023-01-11      7

Expected output

df_expected = pd.concat(
    [df, pd.DataFrame({
        "next_local_max": [25, 25, 24, 24, 12, 12, 25, 12, np.nan, 7, np.nan]
    })],
    axis=1
)

#    id       date  value  next_local_max
# 0   a 2023-01-01      3            25.0
# 1   a 2023-01-02     10            25.0
# 2   b 2023-01-03      2            24.0
# 3   b 2023-01-04     20            24.0
# 4   b 2023-01-05     24            12.0
# 5   b 2023-01-06      9            12.0
# 6   a 2023-01-07     21            25.0
# 7   b 2023-01-08      7            12.0
# 8   a 2023-01-09     25             NaN
# 9   b 2023-01-10     12             7.0
# 10  b 2023-01-11      7             NaN

Edit

This question is well answer. I tried to continue and generalize this issue when more complex max conditions are needed: Perform Excel MAXIFS in Pandas with multiple conditions

3 Answers3

2

You can try to utilize np.tril to compute the local maximum:

def fn(x):
    a = np.tril(x[::-1]).max(axis=1)[::-1]
    return pd.Series(a, index=x.index).shift(-1)

df['next_local_max'] = df.groupby('id', group_keys=False)['value'].apply(fn)
print(df)

Prints:

   id       date  value  next_local_max
0   a 2023-01-01      3            25.0
1   a 2023-01-02     10            25.0
2   b 2023-01-03      2            24.0
3   b 2023-01-04     20            24.0
4   b 2023-01-05     24            12.0
5   b 2023-01-06      9            12.0
6   a 2023-01-07     21            25.0
7   b 2023-01-08      7            12.0
8   a 2023-01-09     25             NaN
9   b 2023-01-10     12             7.0
10  b 2023-01-11      7             NaN

Or: Shorter version with np.triu (so that you skip the array-reversing):

def fn(x):
    return pd.Series(np.triu(x).max(axis=1), index=x.index).shift(-1)

df['next_local_max'] = df.groupby('id', group_keys=False)['value'].apply(fn)
print(df)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
2

You can use pandas cummax() to calculate the cumulative maximum of a series.

import numpy as np
import pandas as pd
from datetime import datetime

df = pd.DataFrame({
    "id": ["a"] * 2 + ["b"] * 4 + ["a", "b"] * 2 + ["b"],
    "date": pd.date_range(datetime(2023, 1, 1), periods=11).tolist(),
    "value": [3, 10, 2, 20, 24, 9, 21, 7, 25, 12, 7]})

def get_next_max_local(group):
    """ Calculate the max for the given group and add a new "next_local_max" column, 
    containing the max local value from the remaining rows in the same group.
    __Details
        - group['next_local_max'] => Assigns the final result to 'next_local_max' column
        - group['value'] => Get the 'value' column of the group df.
        - iloc[::-1] => Reverse the order of the rows, 
            since cummax() need to operate from the end of the group to the beginning.
        - cummax() => Calculate the cumulative maximum of the (reversed) 'value' column, 
            to obtain the largest value seen so far, from the beginning to the current row.
        - shift() => Shift the the cumulative maximum one row forward, to change the max value   
            for each row, that becames the max value from the next row onwards.
    """
    group['next_local_max'] = group['value'].iloc[::-1].cummax().shift()
    
    return group


# Apply the 'get_next_max_local' function to each group selected by 'id'...
# group_keys=False option states that the resulting df only contains   
# the columns that were modified, without 'id'.
df = df.groupby('id', group_keys=False).apply(get_next_max_local)

# Replace last value of each group with NaN
df.loc[df.groupby('id').tail(1).index, 'next_local_max'] = np.nan
Myron_Ben4
  • 432
  • 4
  • 13
  • Thx! What if instead returning the next max, I would like to return the next max of the 2 next days? For id=="b" result will be: [24, 24, 9, 7, 12, 7, NaN] – Alexandre_K Jul 11 '23 at 18:11
  • I've tried to find a solution for a while, without success, now I really need to go to sleep! Tomorrow I'll try again...in any case, I'm quite sure it will be difficult to provide an exhaustive answer via comment. – Myron_Ben4 Jul 11 '23 at 21:02
  • Thx for your help, probably better to post it in a new question, this one is answered I guess. I will work tomorrow on exemples to illustrate better my previous comment – Alexandre_K Jul 11 '23 at 21:56
  • I found a naive solution to this issue, but assuming there is a better way to do it. I posted here https://stackoverflow.com/questions/76668069/perform-excel-maxifs-in-pandas-with-multiple-condition ps. I'm new on stackoverflow, tell me if created new question related to an other is the correct way to interact or if it is preferable to edit this one. – Alexandre_K Jul 12 '23 at 07:29
  • Sure, you did the right thing to create a new question, so anyone can help you! – Myron_Ben4 Jul 12 '23 at 09:29
1

Try this:

(df.assign(
    next_local_max = df.iloc[::-1].groupby('id')['value'].transform(lambda x: x.cummax().shift()))
    )

Output:

   id       date  value  next_local_max
0   a 2023-01-01      3            25.0
1   a 2023-01-02     10            25.0
2   b 2023-01-03      2            24.0
3   b 2023-01-04     20            24.0
4   b 2023-01-05     24            12.0
5   b 2023-01-06      9            12.0
6   a 2023-01-07     21            25.0
7   b 2023-01-08      7            12.0
8   a 2023-01-09     25             NaN
9   b 2023-01-10     12             7.0
10  b 2023-01-11      7             NaN
rhug123
  • 7,893
  • 1
  • 9
  • 24