-2

Can I get some help on how to Find max value of a particular column between 2 dates based on a condition from another column."

I have a df like below and I would need to find the max value in the rows in-between where conditions are - max value of ['high'] column between the 2 'act' value in the ['mark'] column within the same ['symbol'] and store the value in a new column.

i.e. find max of high for APPLE between 04/03/2021 and 09/03/2021 as both these dates have "act" in the mark column. [ there are more 'act' marked in the column, but due to space constrain share a short version here ]

similarly for orange between 04/03/2021 and 10/03/2021.

It should not do this calculation for the "act" marker for Apple on 09 as there is no more act for apple after that.

Data:

date symbol open high low close mark
03/03/2021 APPLE 732 754.95 723.4 729.85
04/03/2021 APPLE 733.25 765.7 715.85 752.45 act
05/03/2021 APPLE 752.45 761 730.5 748.95
08/03/2021 APPLE 762.7 767.8 744.2 748.4
09/03/2021 APPLE 755.55 759.4 738.65 750.75 act
10/03/2021 APPLE 757.5 753.1 743 745.35
12/03/2021 APPLE 743 752.1 723 728.15
15/03/2021 APPLE 727.8 727.8 706.05 719.05
03/03/2021 ORANGE 2406 2417.7 2375.8 2402.1
04/03/2021 ORANGE 2380 2435 2350 2417.1 act
05/03/2021 ORANGE 2399 2423.9 2377.1 2387.1
08/03/2021 ORANGE 2383 2413.5 2360.05 2382.7
09/03/2021 ORANGE 2400 2444 2396.15 2422.7
10/03/2021 ORANGE 2446 2446 2415.55 2431.95 act
12/03/2021 ORANGE 2442.8 2464.65 2397 2401.35
15/03/2021 ORANGE 2402.55 2427.55 2343.05 2355
Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
Dhriti
  • 3
  • 1
  • What answers are you looking for here, is it right to say you'd expect high for APPLE to be 767.8 on the 8th March? Where do you want this number in the new column in this dataframe, or do you want an extract that returns just the values (2 in this case) that you want as a separate table? – Thomas Kimber Mar 30 '21 at 08:47
  • Thank you for asking. I expect high for APPLE to be 767.8 to be in a new column on the row of 04/03/2021. idea being that between the 2 "act" marked date the highest value it reach was 767.8. – Dhriti Mar 30 '21 at 08:57
  • OK, it's an interesting problem - what things have you tried so far, and what sorts of volumes are you looking at here? A solution for a few thousand rows might look different to a solution for many millions of rows. – Thomas Kimber Mar 30 '21 at 09:01
  • I tried using the info in this link but did not take me ahead much. https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby the size of the data would be within 1 to 2 million rows at max. – Dhriti Mar 30 '21 at 09:06

1 Answers1

0

OK, I've taken a crack at this - first I recreated the dataframe:

import pandas as pd

data={("03/03/2021","APPLE",732,754.95,723.4,729.85,), 
      ("04/03/2021","APPLE",733.25,765.7,715.85,752.45,"act"), 
      ("05/03/2021","APPLE",752.45,761,730.5,748.95,), 
      ("08/03/2021","APPLE",762.7,767.8,744.2,748.4,), 
      ("09/03/2021","APPLE",755.55,759.4,738.65,750.75,"act"), 
      ("10/03/2021","APPLE",757.5,753.1,743,745.35,), 
      ("12/03/2021","APPLE",743,752.1,723,728.15,), 
      ("15/03/2021","APPLE",727.8,727.8,706.05,719.05,), 
      ("03/03/2021","ORANGE",2406,2417.7,2375.8,2402.1,), 
      ("04/03/2021","ORANGE",2380,2435,2350,2417.1,"act"), 
      ("05/03/2021","ORANGE",2399,2423.9,2377.1,2387.1,), 
      ("08/03/2021","ORANGE",2383,2413.5,2360.05,2382.7,), 
      ("09/03/2021","ORANGE",2400,2444,2396.15,2422.7,), 
      ("10/03/2021","ORANGE",2446,2446,2415.55,2431.95,"act"), 
      ("12/03/2021","ORANGE",2442.8,2464.65,2397,2401.35,), 
      ("15/03/2021","ORANGE",2402.55,2427.55,2343.05,2355,)}

df = pd.DataFrame(data, 
                  columns=("date","symbol","open","high","low","close","mark")).
                  sort_values(by=["symbol", "date"]).fillna("").reset_index(drop=True)

I figure that what you want to do is a simple max on group-by. The tricky part is manipulating your data so it conforms with what group-by expects. That is, a field on which to group.

def block_diff(series, trigger, start_stop=False):
    toggle = False
    rs = list()
    for i,v in series.iteritems():
        if v==trigger:
            if start_stop and toggle:
                rs.append(toggle)
                toggle=not toggle
            elif start_stop and not toggle:
                toggle=not toggle
                rs.append(toggle)
            elif not start_stop:
                toggle=not toggle
                rs.append(toggle)
        else:
            rs.append(toggle)
    return pd.Series(rs)

So the above function is defined - the idea here is that we want to block-out the regions that are going to feature in the group-by. This function accepts a series, some matching trigger value, and a start_stop flag to fine-tune behavior.

If I apply that to the dataframe, using the returned True/False values as an index to populate a copy of the grouping variable and store the results in a new field called act_block then I create a unique grouping field that also functions as a start-stop filter. At the same time, I also create an additional column called act_sequence which we'll use later to identify the initial starting row for each sub-group.

df['act_block'] = df[block_diff(df['mark'], "act", True)]['symbol']
df['act_sequence'] = df.groupby("act_block").cumcount()
df

    date        symbol  open    high    low     close   mark    act_block   act_sequence
0   03/03/2021  APPLE   732.00  754.95  723.40  729.85          NaN         0
1   04/03/2021  APPLE   733.25  765.70  715.85  752.45  act     APPLE       0
2   05/03/2021  APPLE   752.45  761.00  730.50  748.95          APPLE       1
3   08/03/2021  APPLE   762.70  767.80  744.20  748.40          APPLE       2
4   09/03/2021  APPLE   755.55  759.40  738.65  750.75  act     APPLE       3
5   10/03/2021  APPLE   757.50  753.10  743.00  745.35          NaN         1
6   12/03/2021  APPLE   743.00  752.10  723.00  728.15          NaN         2
7   15/03/2021  APPLE   727.80  727.80  706.05  719.05          NaN         3
8   03/03/2021  ORANGE  2406.00 2417.70 2375.80 2402.10         NaN         4
9   04/03/2021  ORANGE  2380.00 2435.00 2350.00 2417.10 act     ORANGE      0
10  05/03/2021  ORANGE  2399.00 2423.90 2377.10 2387.10         ORANGE      1
11  08/03/2021  ORANGE  2383.00 2413.50 2360.05 2382.70         ORANGE      2
12  09/03/2021  ORANGE  2400.00 2444.00 2396.15 2422.70         ORANGE      3
13  10/03/2021  ORANGE  2446.00 2446.00 2415.55 2431.95 act     ORANGE      4
14  12/03/2021  ORANGE  2442.80 2464.65 2397.00 2401.35         NaN         5
15  15/03/2021  ORANGE  2402.55 2427.55 2343.05 2355.00         NaN         6

Now we can do a simple groupby on act_block, saving the results into a series called max_groups:

max_groups = df.groupby("act_block")["high"].max()


act_block
APPLE      767.8
ORANGE    2446.0
Name: high, dtype: float64

Take this series and merge it with the original dataframe - if we do this with a filter, the max_vals object will inherit the original dataframe's index, allowing us to do a pd.concat to selectively join the two objects together to produce the intended output.

max_vals = df.merge(max_groups, left_on=["act_block"], right_on="act_block",how="left")[(df['act_sequence']==0)].fillna("")['high_y']
max_vals.name="max_val"
new_df = pd.concat([df, max_vals], axis=1).fillna("")
new_df = new_df[["date","symbol","open","high","low","close","mark","max_val"]]

new_df
date symbol open high low close mark max_val
0 03/03/2021 APPLE 732.00 754.95 723.40 729.85
1 04/03/2021 APPLE 733.25 765.70 715.85 752.45 act 767.8
2 05/03/2021 APPLE 752.45 761.00 730.50 748.95
3 08/03/2021 APPLE 762.70 767.80 744.20 748.40
4 09/03/2021 APPLE 755.55 759.40 738.65 750.75 act
5 10/03/2021 APPLE 757.50 753.10 743.00 745.35
6 12/03/2021 APPLE 743.00 752.10 723.00 728.15
7 15/03/2021 APPLE 727.80 727.80 706.05 719.05
8 03/03/2021 ORANGE 2406.00 2417.70 2375.80 2402.10
9 04/03/2021 ORANGE 2380.00 2435.00 2350.00 2417.10 act 2446
10 05/03/2021 ORANGE 2399.00 2423.90 2377.10 2387.10
11 08/03/2021 ORANGE 2383.00 2413.50 2360.05 2382.70
12 09/03/2021 ORANGE 2400.00 2444.00 2396.15 2422.70
13 10/03/2021 ORANGE 2446.00 2446.00 2415.55 2431.95 act
14 12/03/2021 ORANGE 2442.80 2464.65 2397.00 2401.35
15 15/03/2021 ORANGE 2402.55 2427.55 2343.05 2355.00
Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
  • Firstly, a big thank you for taking the time out for this. Since I am a beginner, I am a bit overwhelmed looking at this, but I am positive I will be able to grasp this with a little time. I do see the your output result just the way I need it to be. Again, cant thank you enough. Kind regards Dhriti – Dhriti Mar 30 '21 at 11:12
  • You're welcome - I think it's an interesting question because it combines 3 or 4 operations that would normally be relatively straightforward on their own, but this problem requires combining them in an interesting way. As you try it out, take a look at the intermediate dataframes/objects etc - which should help understanding what's happening at each stage. – Thomas Kimber Mar 30 '21 at 12:43