Use groupby() and condition to select rows in pd.df

Question

My pd.df looks like this:

                          open     high      low     close   volume
timestamp  expiry                                                  
2018-09-10 2018-09-21  2885.25  2888.25  2876.50  2880.250   999262
           2018-12-21  2889.75  2893.00  2881.25  2885.000    15999
2018-09-11 2018-09-21  2871.25  2893.00  2867.25  2889.750   973957
           2018-12-21  2876.00  2897.75  2872.25  2894.500    25031
2018-09-12 2018-09-21  2888.00  2895.25  2879.50  2888.375  1252385
           2018-12-21  2893.00  2900.25  2884.50  2893.375    54971
2018-09-13 2018-09-21  2899.75  2907.00  2896.50  2905.250   978670
           2018-12-21  2905.00  2912.00  2901.75  2910.250   226989
2018-09-14 2018-09-21  2907.00  2909.50  2896.75  2906.250   507802
           2018-12-21  2912.25  2914.50  2902.00  2911.375   703911
2018-09-17 2018-09-21  2904.50  2905.25  2887.25  2891.000   311638
           2018-12-21  2909.50  2910.25  2892.25  2896.000   894660
2018-09-18 2018-09-21  2892.25  2912.50  2892.00  2906.500   206016
           2018-12-21  2897.75  2917.75  2897.25  2911.750   946741
2018-09-19 2018-09-21  2906.00  2913.75  2904.75  2909.875   130272
           2018-12-21  2911.50  2918.75  2910.00  2915.125   828194
2018-09-20 2018-09-21  2922.50  2935.75  2921.25  2934.125   132722
           2018-12-21  2927.75  2940.75  2926.25  2939.375  1063115
2018-09-21 2018-12-21  2945.25  2946.50  2932.00  2933.750  1181406

I need to select the rows where the volume is bigger. Not sure how to use groupby() or possibly drop_duplicates() for this.

Desired output should look like this:

                          open     high      low     close   volume
timestamp  expiry                                                  
2018-09-10 2018-09-21  2885.25  2888.25  2876.50  2880.250   999262
2018-09-11 2018-09-21  2871.25  2893.00  2867.25  2889.750   973957
2018-09-12 2018-09-21  2888.00  2895.25  2879.50  2888.375  1252385
2018-09-13 2018-09-21  2899.75  2907.00  2896.50  2905.250   978670
2018-09-14 2018-12-21  2912.25  2914.50  2902.00  2911.375   703911
2018-09-17 2018-12-21  2909.50  2910.25  2892.25  2896.000   894660
2018-09-18 2018-12-21  2897.75  2917.75  2897.25  2911.750   946741
2018-09-19 2018-12-21  2911.50  2918.75  2910.00  2915.125   828194
2018-09-20 2018-12-21  2927.75  2940.75  2926.25  2939.375  1063115
2018-09-21 2018-12-21  2945.25  2946.50  2932.00  2933.750  1181406

Grateful for your help!

tawab_shakeel · Accepted Answer · 2019-08-19T04:34:59.610

1

Make sure your timestamp in not in index use reset_index() if it is index

sort values on timestamp and volume and sort volume in descending_order

then drop_duplicates using timestamp

df.sort_values(by=['timestamp','volume'],ascending =[True,False],inplace=True)
df.drop_duplicates(subset=['timestamp'],keep='first',inplace=True)

edited Aug 19 '19 at 04:34

answered Aug 19 '19 at 04:29

tawab_shakeel

3,701
10
26

score 0 · Answer 2 · answered Aug 19 '19 at 04:30

You can try something along the lines of this

max_vol = df.reset_index().groupby(by = ['timestamp', 'expiry'])['volume'].max()
max_vol.reset_index(inplace = True)

pd.merge(left = df.reset_index(), right = max_vol, on = ['timestamp', 'expiry', 'volume'], how = 'inner')

Use groupby() and condition to select rows in pd.df

2 Answers2