How do I Find max value of a particular column between 2 dates based on a condition from another column in python

Question

Can I get some help on how to Find max value of a particular column between 2 dates based on a condition from another column."

I have a df like below and I would need to find the max value in the rows in-between where conditions are - max value of ['high'] column between the 2 'act' value in the ['mark'] column within the same ['symbol'] and store the value in a new column.

i.e. find max of high for APPLE between 04/03/2021 and 09/03/2021 as both these dates have "act" in the mark column. [ there are more 'act' marked in the column, but due to space constrain share a short version here ]

similarly for orange between 04/03/2021 and 10/03/2021.

It should not do this calculation for the "act" marker for Apple on 09 as there is no more act for apple after that.

Data:

date	symbol	open	high	low	close	mark
03/03/2021	APPLE	732	754.95	723.4	729.85
04/03/2021	APPLE	733.25	765.7	715.85	752.45	act
05/03/2021	APPLE	752.45	761	730.5	748.95
08/03/2021	APPLE	762.7	767.8	744.2	748.4
09/03/2021	APPLE	755.55	759.4	738.65	750.75	act
10/03/2021	APPLE	757.5	753.1	743	745.35
12/03/2021	APPLE	743	752.1	723	728.15
15/03/2021	APPLE	727.8	727.8	706.05	719.05
03/03/2021	ORANGE	2406	2417.7	2375.8	2402.1
04/03/2021	ORANGE	2380	2435	2350	2417.1	act
05/03/2021	ORANGE	2399	2423.9	2377.1	2387.1
08/03/2021	ORANGE	2383	2413.5	2360.05	2382.7
09/03/2021	ORANGE	2400	2444	2396.15	2422.7
10/03/2021	ORANGE	2446	2446	2415.55	2431.95	act
12/03/2021	ORANGE	2442.8	2464.65	2397	2401.35
15/03/2021	ORANGE	2402.55	2427.55	2343.05	2355

What answers are you looking for here, is it right to say you'd expect high for APPLE to be 767.8 on the 8th March? Where do you want this number in the new column in this dataframe, or do you want an extract that returns just the values (2 in this case) that you want as a separate table? — Thomas Kimber, Mar 30 '21 at 08:47
Thank you for asking. I expect high for APPLE to be 767.8 to be in a new column on the row of 04/03/2021. idea being that between the 2 "act" marked date the highest value it reach was 767.8. — Dhriti, Mar 30 '21 at 08:57
OK, it's an interesting problem - what things have you tried so far, and what sorts of volumes are you looking at here? A solution for a few thousand rows might look different to a solution for many millions of rows. — Thomas Kimber, Mar 30 '21 at 09:01
I tried using the info in this link but did not take me ahead much. https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby the size of the data would be within 1 to 2 million rows at max. — Dhriti, Mar 30 '21 at 09:06

score 0 · Accepted Answer · answered Mar 30 '21 at 10:52

OK, I've taken a crack at this - first I recreated the dataframe:

import pandas as pd

data={("03/03/2021","APPLE",732,754.95,723.4,729.85,), 
      ("04/03/2021","APPLE",733.25,765.7,715.85,752.45,"act"), 
      ("05/03/2021","APPLE",752.45,761,730.5,748.95,), 
      ("08/03/2021","APPLE",762.7,767.8,744.2,748.4,), 
      ("09/03/2021","APPLE",755.55,759.4,738.65,750.75,"act"), 
      ("10/03/2021","APPLE",757.5,753.1,743,745.35,), 
      ("12/03/2021","APPLE",743,752.1,723,728.15,), 
      ("15/03/2021","APPLE",727.8,727.8,706.05,719.05,), 
      ("03/03/2021","ORANGE",2406,2417.7,2375.8,2402.1,), 
      ("04/03/2021","ORANGE",2380,2435,2350,2417.1,"act"), 
      ("05/03/2021","ORANGE",2399,2423.9,2377.1,2387.1,), 
      ("08/03/2021","ORANGE",2383,2413.5,2360.05,2382.7,), 
      ("09/03/2021","ORANGE",2400,2444,2396.15,2422.7,), 
      ("10/03/2021","ORANGE",2446,2446,2415.55,2431.95,"act"), 
      ("12/03/2021","ORANGE",2442.8,2464.65,2397,2401.35,), 
      ("15/03/2021","ORANGE",2402.55,2427.55,2343.05,2355,)}

df = pd.DataFrame(data, 
                  columns=("date","symbol","open","high","low","close","mark")).
                  sort_values(by=["symbol", "date"]).fillna("").reset_index(drop=True)

I figure that what you want to do is a simple max on group-by. The tricky part is manipulating your data so it conforms with what group-by expects. That is, a field on which to group.

def block_diff(series, trigger, start_stop=False):
    toggle = False
    rs = list()
    for i,v in series.iteritems():
        if v==trigger:
            if start_stop and toggle:
                rs.append(toggle)
                toggle=not toggle
            elif start_stop and not toggle:
                toggle=not toggle
                rs.append(toggle)
            elif not start_stop:
                toggle=not toggle
                rs.append(toggle)
        else:
            rs.append(toggle)
    return pd.Series(rs)

So the above function is defined - the idea here is that we want to block-out the regions that are going to feature in the group-by. This function accepts a series, some matching trigger value, and a start_stop flag to fine-tune behavior.

If I apply that to the dataframe, using the returned True/False values as an index to populate a copy of the grouping variable and store the results in a new field called act_block then I create a unique grouping field that also functions as a start-stop filter. At the same time, I also create an additional column called act_sequence which we'll use later to identify the initial starting row for each sub-group.

df['act_block'] = df[block_diff(df['mark'], "act", True)]['symbol']
df['act_sequence'] = df.groupby("act_block").cumcount()
df

    date        symbol  open    high    low     close   mark    act_block   act_sequence
0   03/03/2021  APPLE   732.00  754.95  723.40  729.85          NaN         0
1   04/03/2021  APPLE   733.25  765.70  715.85  752.45  act     APPLE       0
2   05/03/2021  APPLE   752.45  761.00  730.50  748.95          APPLE       1
3   08/03/2021  APPLE   762.70  767.80  744.20  748.40          APPLE       2
4   09/03/2021  APPLE   755.55  759.40  738.65  750.75  act     APPLE       3
5   10/03/2021  APPLE   757.50  753.10  743.00  745.35          NaN         1
6   12/03/2021  APPLE   743.00  752.10  723.00  728.15          NaN         2
7   15/03/2021  APPLE   727.80  727.80  706.05  719.05          NaN         3
8   03/03/2021  ORANGE  2406.00 2417.70 2375.80 2402.10         NaN         4
9   04/03/2021  ORANGE  2380.00 2435.00 2350.00 2417.10 act     ORANGE      0
10  05/03/2021  ORANGE  2399.00 2423.90 2377.10 2387.10         ORANGE      1
11  08/03/2021  ORANGE  2383.00 2413.50 2360.05 2382.70         ORANGE      2
12  09/03/2021  ORANGE  2400.00 2444.00 2396.15 2422.70         ORANGE      3
13  10/03/2021  ORANGE  2446.00 2446.00 2415.55 2431.95 act     ORANGE      4
14  12/03/2021  ORANGE  2442.80 2464.65 2397.00 2401.35         NaN         5
15  15/03/2021  ORANGE  2402.55 2427.55 2343.05 2355.00         NaN         6

Now we can do a simple groupby on act_block, saving the results into a series called max_groups:

max_groups = df.groupby("act_block")["high"].max()


act_block
APPLE      767.8
ORANGE    2446.0
Name: high, dtype: float64

Take this series and merge it with the original dataframe - if we do this with a filter, the max_vals object will inherit the original dataframe's index, allowing us to do a pd.concat to selectively join the two objects together to produce the intended output.

max_vals = df.merge(max_groups, left_on=["act_block"], right_on="act_block",how="left")[(df['act_sequence']==0)].fillna("")['high_y']
max_vals.name="max_val"
new_df = pd.concat([df, max_vals], axis=1).fillna("")
new_df = new_df[["date","symbol","open","high","low","close","mark","max_val"]]

new_df

	date	symbol	open	high	low	close	mark	max_val
0	03/03/2021	APPLE	732.00	754.95	723.40	729.85
1	04/03/2021	APPLE	733.25	765.70	715.85	752.45	act	767.8
2	05/03/2021	APPLE	752.45	761.00	730.50	748.95
3	08/03/2021	APPLE	762.70	767.80	744.20	748.40
4	09/03/2021	APPLE	755.55	759.40	738.65	750.75	act
5	10/03/2021	APPLE	757.50	753.10	743.00	745.35
6	12/03/2021	APPLE	743.00	752.10	723.00	728.15
7	15/03/2021	APPLE	727.80	727.80	706.05	719.05
8	03/03/2021	ORANGE	2406.00	2417.70	2375.80	2402.10
9	04/03/2021	ORANGE	2380.00	2435.00	2350.00	2417.10	act	2446
10	05/03/2021	ORANGE	2399.00	2423.90	2377.10	2387.10
11	08/03/2021	ORANGE	2383.00	2413.50	2360.05	2382.70
12	09/03/2021	ORANGE	2400.00	2444.00	2396.15	2422.70
13	10/03/2021	ORANGE	2446.00	2446.00	2415.55	2431.95	act
14	12/03/2021	ORANGE	2442.80	2464.65	2397.00	2401.35
15	15/03/2021	ORANGE	2402.55	2427.55	2343.05	2355.00

Firstly, a big thank you for taking the time out for this. Since I am a beginner, I am a bit overwhelmed looking at this, but I am positive I will be able to grasp this with a little time. I do see the your output result just the way I need it to be. Again, cant thank you enough. Kind regards Dhriti — Dhriti, Mar 30 '21 at 11:12
You're welcome - I think it's an interesting question because it combines 3 or 4 operations that would normally be relatively straightforward on their own, but this problem requires combining them in an interesting way. As you try it out, take a look at the intermediate dataframes/objects etc - which should help understanding what's happening at each stage. — Thomas Kimber, Mar 30 '21 at 12:43

How do I Find max value of a particular column between 2 dates based on a condition from another column in python

1 Answers1