2

I have a dataframe like this:

                      close
formated                                                               
2017-01-03 09:30:00   29.9713
2017-01-03 09:31:00   29.0622
2017-01-03 09:32:00   29.0750
2017-01-03 09:33:00   29.0276
2017-01-03 09:34:00   29.0375
...                       ...
2022-08-19 09:30:00  173.5500  
2022-08-19 09:31:00  173.4494
2022-08-19 09:32:00  173.3400
2022-08-19 09:33:00  173.3900
2022-08-19 09:34:00  173.2600

The df contains a price value for every minute of the day starting from 9:30 through 16:00. I am looking to get the maximum possible profit for each minute of the day.

I am currently doing something like this:

df['Profit'] = (df.groupby(pd.Grouper(freq='D'))['close'].transform('max') - df['close']).div(df['close'])

This gives me the percentage of the profit for each row to the highest value of the day.
However, this approach is flawed, because it also calculates the profit percentage from timestamps after the highest value of the day has already been reached.
But i don't want the max value for the whole day, but rather the max value per day with only the timestamps that come after the row in the dataframe that we are currently looking at.

Profit is defined as the difference between the value at that minute, and the maximum value across all following minutes in the same day.

Desired Output:

                      close     Profit abs.  Profit perc.
formated                                                               
2017-01-03 09:30:00   29.9713   0.0          0.0
2017-01-03 09:31:00   29.0622   0.0128       0.0004404
2017-01-03 09:32:00   29.0750   0.0          0.0
2017-01-03 09:33:00   29.0276   0.0099       0.0003410
2017-01-03 09:34:00   29.0375   0.0          0.0   
...                       ...      ...             ...

Please no solutions that iterate over the dataframe in a for loop as this is extremely slow.

PWebber
  • 23
  • 5
  • An "extremely slow" solution is clearly superior to no solution at all. Perhaps you should convert these to columns to Python lists, where iteration is easier. That's the kind of solution you're going to need. – Tim Roberts Oct 06 '22 at 20:49
  • for a given day, is profit for a given minute defined as the difference between the value at that minute, and the maximum across all following minutes ? – dermen Oct 06 '22 at 20:57
  • I was afraid that this is the answer I will get, because that's what I am doing now: Iterating in a for loop. It takes multiple hours to iterate the complete dataframe this way. I even split the dataframe into daily chunks so I can use multiprocessing. I was hoping that there is some "pandas magic" that I am unaware of. – PWebber Oct 06 '22 at 20:58
  • @dermen Exactly. – PWebber Oct 06 '22 at 20:58

1 Answers1

2

You can use the cummax function from pandas, to compute the cumulative max over each day. However, you will need to apply it in reverse.

if we do this on an hourly basis, for the sake of explaining:

>>> example_df
date              price
2020-01-01 00:00  1
2020-01-01 01:00  2
2020-01-01 03:00  1
2020-01-01 04:00  7
2020-01-01 05:00  5
2020-01-01 06:00  2
2020-01-01 07:00  4

>>> example_df.reverse_cummax()
date              price  reverse_cummax
2020-01-01 00:00  1      7
2020-01-01 01:00  2      7
2020-01-01 03:00  1      7
2020-01-01 04:00  7      7
2020-01-01 05:00  5      5
2020-01-01 06:00  2      4
2020-01-01 07:00  4      4

this is what the output of a reverse_cummax function would look like.

We can't use directly cummax from pandas, and there is no easy way to reverse it, besides reversing the whole dataframe. We need to create a subfunction that reverses the values of the columns on the "grouped by" data frames we are interested in directly, and then reverse them again once we are done:

def reverse_cummax(df_day):
    return df_day.loc[::-1, "close"].cummax()[::-1]

# This function produces a MultiLevelIndex, but we want our initial index
# back, so we need to drop a level
df["reverse_cummax"] = df.groupby(pd.Grouper(freq='D')).apply(reverse_cummax).droplevel(0)

then, you can get your profit by subtracting each value to the daily maximum value to come (reverse_cummax)

df["profit abs"] = df["reverse_cummax"] - df["close"]
df["profit percentage"] = df["profit abs"]/df["close"]

Edited to add the improved solution as suggested by dermen

Florent Monin
  • 1,278
  • 1
  • 3
  • 16
  • 1
    Nice idea. I think you can follow this [recipe](https://stackoverflow.com/a/37872823/2077270) to get the reverse cummax in a single go: `df_day['reverse_cummax'] = df_day.loc[::-1, "close"].cummax()[::-1]`. – dermen Oct 07 '22 at 00:08
  • Thank you! Since we want to apply the cummax on the `DataFrameGroupBy`, I don't think that works, `df.groupby(pd.Grouper(freq='D')).loc[:: -1, "close"].cummax()` returns a `AttributeError: 'DataFrameGroupBy' object has no attribute 'loc'`. And reversing the dataframe before the `groupby` is a no go either, because of the DateTimeIndex, from what I've tried – Florent Monin Oct 07 '22 at 00:26
  • 1
    One can define a function `func` with a single argument `def func(df_day)`. Have it perform the operation `df_day['reverse_cummax'] = df_day.loc[::-1, "close"].cummax()[::-1]` and then have it return `df_day`. Then use the groupby [apply](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.apply.html) method : `gb.groupby(pd.Grouper(freq='D')).apply(func)` . – dermen Oct 07 '22 at 01:12
  • 1
    Oh right, yes, that does work, thank you for the tip! Will edit my post with that improved version – Florent Monin Oct 07 '22 at 06:33