pad a data frame according to a frequency for each group

Question

I have a pandas.DataFrame df with a pandas.DatetimeIndex and a column named group_column. I need the df to have a minutely frequency (meaning there is a row for every minute).

however this needs to be case for every value in the group_column, so every minute can potentially have several values.

NOTE:

the group_column can have hundreds of unique values.
some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in group_column.

example

Input:

dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame({'group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2]}, index=dates)

                    group_column        data_column
2018-01-01 12:00:00      a                  1.2
2018-01-01 12:01:00      a                  2.2
2018-01-01 12:01:00      b                  4.0
2018-01-01 12:03:00      a                  1.0
2018-01-01 12:04:00      b                  2.0

desired output:

                    group_column    data_column
2018-01-01 12:00:00      a              1.2
2018-01-01 12:01:00      a              2.2
2018-01-01 12:02:00      a              2.2
2018-01-01 12:03:00      a              1.0
2018-01-01 12:01:00      b              4.0
2018-01-01 12:02:00      b              4.0
2018-01-01 12:03:00      b              4.0
2018-01-01 12:04:00      b              2.0

my attempt

I have done this, however it seems highly inefficient:

def group_resmaple(df, group_column_name):
    values = df[group_column_name].unique()
    for value in values:
        df_g = df.loc[df[group_column]==value]
        df_g = df_g.asfreq('min', 'pad')
        yield df_g

df_paded = pd.concat(group_resmaple(df, 'group_column'))

I think you should change `group_column` to be part of the index (`df.set_index('group_column', inplace=True, drop=False)`), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See https://stackoverflow.com/questions/15799162/resampling-within-a-pandas-multiindex — John Zwinck, Nov 11 '18 at 10:22

score 1 · Accepted Answer · answered Nov 11 '18 at 10:22

Use GroupBy.apply with asfreq:

df1 = (df.groupby('group_column')
         .apply(lambda x: x.asfreq('min', 'pad'))
         .reset_index(level=0, drop=True))
print (df1)
                    group_column  data_column
2018-01-01 12:00:00            a          1.2
2018-01-01 12:01:00            a          2.2
2018-01-01 12:02:00            a          2.2
2018-01-01 12:03:00            a          1.0
2018-01-01 12:01:00            b          4.0
2018-01-01 12:02:00            b          4.0
2018-01-01 12:03:00            b          4.0
2018-01-01 12:04:00            b          2.0

score 1 · Answer 2 · answered Nov 11 '18 at 15:43

1

My approach would be

df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
print(df2)

                     data_column group_column
2018-01-01 12:00:00          1.2            a               
2018-01-01 12:01:00          2.2            a               
2018-01-01 12:02:00          2.2            a               
2018-01-01 12:03:00          1.0            a               
2018-01-01 12:01:00          4.0            b               
2018-01-01 12:02:00          4.0            b               
2018-01-01 12:03:00          4.0            b               
2018-01-01 12:04:00          2.0            b

answered Nov 11 '18 at 15:43

SpghttCd

10,510
2
20
25

why is this better then jezrael answer ? faster ? – moshevi Nov 11 '18 at 16:38
I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses `apply` which my approach does not, so it might be worth comparing performance... – SpghttCd Nov 11 '18 at 16:48
In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought _"no apply is better"_, but obviously this is wrong. Now _why_ should perhaps better asked to jezrael... – SpghttCd Nov 11 '18 at 22:12

pad a data frame according to a frequency for each group

2 Answers2