Python Pandas - find consecutive group with max aggregate values

Question

I have a data frame with datetimes and integers

import numpy as np
import pandas as pd

df = pd.DataFrame()
df['dt'] = pd.date_range("2017-01-01 12:00", "2017-01-01 12:30", freq="1min")
df['val'] = np.random.choice(xrange(1, 100), df.shape[0])

Gives me

                    dt  val
0  2017-01-01 12:00:00   33
1  2017-01-01 12:01:00   42
2  2017-01-01 12:02:00   44
3  2017-01-01 12:03:00    6
4  2017-01-01 12:04:00   70
5  2017-01-01 12:05:00   94*
6  2017-01-01 12:06:00   42*
7  2017-01-01 12:07:00   97*
8  2017-01-01 12:08:00   12
9  2017-01-01 12:09:00   11
10 2017-01-01 12:10:00   66
11 2017-01-01 12:11:00   71
12 2017-01-01 12:12:00   25
13 2017-01-01 12:13:00   23
14 2017-01-01 12:14:00   39
15 2017-01-01 12:15:00   25

How can I find which N-minute group of consecutive dt gives me the maximum sum of val?

In this case, if N=3, then the result should be:

                    dt  val
5  2017-01-01 12:05:00   94
6  2017-01-01 12:06:00   42
7  2017-01-01 12:07:00   97

(marked with stars above)

miradulo · Accepted Answer · 2017-02-17T22:59:23.950

7

You could use np.convolve to get the correct starting index and go from there.

def cons_max(df, N):
    max_loc = np.convolve(df.val, np.ones(N, dtype=int), mode='valid').argmax()
    return df.loc[max_loc:max_loc+N-1]

Demo

>>> cons_max(df, 3)
                   dt  val
5 2017-01-01 12:05:00   94
6 2017-01-01 12:06:00   42
7 2017-01-01 12:07:00   97

>>> cons_max(df, 5)
                   dt  val
4 2017-01-01 12:04:00   70
5 2017-01-01 12:05:00   94
6 2017-01-01 12:06:00   42
7 2017-01-01 12:07:00   97
8 2017-01-01 12:08:00   12

This works be effectively "sliding" the kernel (array of ones) across our input and multiply-accumulating the elements in our window of size N together.

edited Feb 17 '17 at 22:59

answered Feb 17 '17 at 22:31

miradulo

28,857
6
80
93

thanks. This works quite well for parameterizing `N` – philshem Feb 17 '17 at 22:34
that's actually a very interesting way to do it... could be extended lots of ways... thanks for pointing this out! – Corley Brigman Feb 17 '17 at 22:37
if df.val is a float instead of an int, must np.ones(3,dtype=float) or is an int still OK? – philshem Feb 17 '17 at 22:39
@philshem Yes, `int` is still fine - those are just our multipliers. – miradulo Feb 17 '17 at 22:47
1

@CorleyBrigman Credit to some cookbook I can't recall where I learnt about `convolve`, but you're welcome! – miradulo Feb 17 '17 at 22:47
FWIW - in my real code, convolve() doesn't work with my datetime (`TypeError: Cannot cast array data from dtype(' – philshem Feb 17 '17 at 23:16
@philshem Because a summing op on datetimes is undefined - use the index! – miradulo Feb 18 '17 at 00:30

score 6 · Answer 2 · edited May 23 '17 at 12:17

You could use rolling/sum and np.nanargmax to find the index associated with the first occurrence of the maximum value:

import numpy as np
import pandas as pd

df = pd.DataFrame({'dt': ['2017-01-01 12:00:00', '2017-01-01 12:01:00', '2017-01-01 12:02:00', '2017-01-01 12:03:00', '2017-01-01 12:04:00', '2017-01-01 12:05:00', '2017-01-01 12:06:00', '2017-01-01 12:07:00', '2017-01-01 12:08:00', '2017-01-01 12:09:00', '2017-01-01 12:10:00', '2017-01-01 12:11:00', '2017-01-01 12:12:00', '2017-01-01 12:13:00', '2017-01-01 12:14:00', '2017-01-01 12:15:00'], 'val': [33, 42, 44, 6, 70, 94, 42, 97, 12, 11, 66, 71, 25, 23, 39, 25]})
df.index = df.index*10

N = 3
idx = df['val'].rolling(window=N).sum()
i = np.nanargmax(idx) + 1
print(df.iloc[i-N : i])

prints

                     dt  val
50  2017-01-01 12:05:00   94
60  2017-01-01 12:06:00   42
70  2017-01-01 12:07:00   97

iloc uses ordinal indexing. loc uses label-based indexing. Provided that both i-N and i are valid indices, df.iloc[i-N : i] will grab a window (sub-DataFrame) of length N. In contrast, df.loc[i-N, i] will only grab a window of length N if the index uses consecutive integers. The example above shows a DataFrame where df.loc would not work since df.index has non-consecutive integer values.

FWIW - I can't get `rolling()` to work with either my timestamp (`ops for Rolling for this dtype datetime64[ns] are not implemented`) or my index (`'Int64Index' object has no attribute 'rolling'`). — philshem, Feb 17 '17 at 23:14
`rolling` does not work with `datetime64`s because (for example) summing `datetime64`s is not defined. If you wish to use `rolling` on an integer-valued index, you could use `df.index.to_series().rolling(...)`. — unutbu, Feb 17 '17 at 23:52

Corley Brigman · Answer 3 · 2017-02-17T22:36:26.427

1

For simple single values, you can use something like:

df['total'] = df.val + df.val.shift(-1) + df.val.shift(-2)
first = df.dropna().sort('total').index[-1]
df.iloc[first:first+3]

Not sure how to generalize this... with most things pandas, there is probably an easier way, but this does work.

Edit: After a little more work, it looks like rolling is what you want:

last = df.val.rolling(3).sum().dropna().sort_values().index[-1]

This is a slightly different, in that the index you get here is the end, so after doing the above you want to do

df.iloc[last-2:last+1]

I think that could be generalized.

edited Feb 17 '17 at 22:36

answered Feb 17 '17 at 22:23

Corley Brigman

11,633
5
33
40

good idea. Is there a way to parameterize the first line, in the case that I wanted N=100 instead of N=3? – philshem Feb 17 '17 at 22:24

Python Pandas - find consecutive group with max aggregate values

3 Answers3

Linked