Resampling a pandas dataframe with multi-index containing timeseries

Question

apologies from creating what appears to be a duplicate of this question. I have a dataframe that is shaped more or less like the one below:

df_lenght = 240
df = pd.DataFrame(np.random.randn(df_lenght,2), columns=['a','b'] )
df['datetime'] = pd.date_range('23/06/2017', periods=df_lenght, freq='H')

unique_jobs = ['job1','job2','job3',]
job_id = [unique_jobs for i in range (1, int((df_lenght/len(unique_jobs))+1) ,1) ]
df['job_id'] = sorted( [val for sublist in job_id for val in sublist] )

df.set_index(['job_id','datetime'], append=True, inplace=True)

print(df[:5]) returns:

                                     a         b
  job_id datetime                               
0 job1   2017-06-23 00:00:00 -0.067011 -0.516382
1 job1   2017-06-23 01:00:00 -0.174199  0.068693
2 job1   2017-06-23 02:00:00 -1.227568 -0.103878
3 job1   2017-06-23 03:00:00 -0.847565 -0.345161
4 job1   2017-06-23 04:00:00  0.028852  3.111738

I will need to resample df['a'] to derive a daily rolling mean, i.e. apply a .resample('D').mean().rolling(window=2).mean().

I have tried two methods:

1 - unstacking and stacking, as recommended here

df.unstack('job_id','datetime').resample('D').mean().rolling(window=2).mean().stack('job_id', 'datetime')

this returns an error

2 - using pd.Grouper, as recommended here

level_values = df.index.get_level_values
result = df.groupby( [ level_values(i) for i in [0,1] ] + [ pd.Grouper(freq='D', level=2) ] ).mean().rolling(window=2).mean()

this does not return an error but it does not seem to resample/group the df appropriately. Result seems to contain hourly data points, rather than daily:

print(result[:5])
                            a         b
  job_id datetime                      
0 job1   2017-06-23       NaN       NaN
1 job1   2017-06-23  0.831609  1.348970
2 job1   2017-06-23 -0.560047  1.063316
3 job1   2017-06-23 -0.641936 -0.199189
4 job1   2017-06-23  0.254402 -0.328190

in the index hierarchy, `job_id` comes before `datetime`, i.e. for each value of `job_id` there will be a set of `datetime` values — Andreuccio, Jun 23 '17 at 10:00
@Allen I have amended the question to better represent my dataset — Andreuccio, Jun 23 '17 at 10:48

score 5 · Accepted Answer · answered Jun 23 '17 at 11:02

First let's define a resampler function:

def resampler(x):    
    return x.set_index('datetime').resample('D').mean().rolling(window=2).mean()

Then, we groupby job_id and apply the resampler function:

 df.reset_index(level=2).groupby(level=1).apply(resampler)

Out[657]: 
                          a         b
job_id datetime                      
job1   2017-06-23       NaN       NaN
       2017-06-24  0.053378  0.004727
       2017-06-25  0.265074  0.234081
       2017-06-26  0.192286  0.138148
job2   2017-06-26       NaN       NaN
       2017-06-27 -0.016629 -0.041284
       2017-06-28 -0.028662  0.055399
       2017-06-29  0.113299 -0.204670
job3   2017-06-29       NaN       NaN
       2017-06-30  0.233524 -0.194982
       2017-07-01  0.068839 -0.237573
       2017-07-02 -0.051211 -0.069917

Let me know if this is what you are after.

that does not seem to work for my case. I noticed you suggested to drop the default index (with numerals), but that bit on answer disappeared. Could that be what prevents the groupby to work? — Andreuccio, Jun 23 '17 at 11:15
I realized we can just ignore the default index. When you run the code, does it work on the example data you provided? — Allen Qin, Jun 23 '17 at 11:21

score 5 · Answer 2 · answered Jun 23 '17 at 13:11

5

IIUC, you wish to group by job_id and (daily) datetimes, and wish to ignore the first level of the DataFrame index. Therefore, instead of grouping by

( [ level_values(i) for i in [0,1] ] + [ pd.Grouper(freq='D', level=2) ] )

you'd want to groupby

[df.index.get_level_values(1), pd.Grouper(freq='D', level=2)]

import numpy as np
import pandas as pd
np.random.seed(2017)

df_length = 240
df = pd.DataFrame(np.random.randn(df_length,2), columns=['a','b'] )
df['datetime'] = pd.date_range('23/06/2017', periods=df_length, freq='H')

unique_jobs = ['job1','job2','job3',]
job_id = [unique_jobs for i in range (1, int((df_length/len(unique_jobs))+1) ,1) ]
df['job_id'] = sorted( [val for sublist in job_id for val in sublist] )

df.set_index(['job_id','datetime'], append=True, inplace=True)

grouped = df.groupby([df.index.get_level_values(1), pd.Grouper(freq='D', level=2)])
result = grouped.mean().rolling(window=2).mean()

print(result)

yields

                          a         b
job_id datetime                      
job1   2017-06-23       NaN       NaN
       2017-06-24 -0.203083  0.176141
       2017-06-25 -0.077083  0.072510
       2017-06-26 -0.237611 -0.493329
job2   2017-06-26 -0.297775 -0.370543
       2017-06-27  0.005124  0.052603
       2017-06-28  0.226142 -0.015584
       2017-06-29 -0.065595  0.210628
job3   2017-06-29 -0.186865  0.347683
       2017-06-30  0.051508  0.029909
       2017-07-01  0.005341  0.075378
       2017-07-02 -0.027131  0.132192

answered Jun 23 '17 at 13:11

unutbu

842,883
184
1,785
1,677

thank you very much, your solution also worked. I have just realised how tricky would be for me to re-arrange all the data manipulation using multi-indexes, thus I will attempt to `slice` the main df into many, i.e. one for each value of `job_id`. Not quite sure about how to do it, possibly using a dictionary? I'll post a separate question – Andreuccio Jun 23 '17 at 15:12
@Andreuccio: Vectorized operations such as those provided by NumPy and Pandas perform best when applied to a single large array or NDFrame (rather than multiple calls on smaller arrays or NDFrames). Therefore, for the sake of performance it is usually better to keep one large DataFrame with a MultiIndex rather than a dict of smaller DataFrames. If performance is a priority, you may want to pose a question about how to deal with the MultiIndex... – unutbu Jun 23 '17 at 15:24
I totally get your rationale, but the problem is I am spending days trying to perform the data analysis with multi-index, so I am not really bothered about a couple of minutes more of cpu time! Have posted the question here: https://stackoverflow.com/questions/44725105/sub-select-a-multi-index-pandas-dataframe-to-create-multiple-subsets-using-a-di – Andreuccio Jun 23 '17 at 15:31
can you please open a chatroom with me? – Andreuccio Jun 23 '17 at 18:02
Sorry, this is not a good time for me, but if you leave a comment I'll try to get back to you. – unutbu Jun 23 '17 at 20:36
are you available today? – Andreuccio Jun 30 '17 at 13:40

Resampling a pandas dataframe with multi-index containing timeseries

2 Answers2

Linked