8

With pandas.DataFrame.resample I can downsample a DataFrame into a certain time duration:

df.resample("3s").mean()

However, I do not want to specify a certain time, but rather a fixed number of rows in the original data frame, e.g. "resample such that three rows previously are now aggregated into one". How's that possible in pandas?

knub
  • 3,892
  • 8
  • 38
  • 63
  • 1
    Something like this? https://stackoverflow.com/questions/36810595/calculate-average-of-every-x-rows-in-a-table-and-create-new-table/36810658#36810658 – ayhan Jun 01 '17 at 11:05
  • 2
    Do you need `df = df.groupby(np.arange(len(df.index)) // 3).mean()` ? – jezrael Jun 01 '17 at 11:05
  • @jezrael: I was hoping for sth. more idiomatic, less hacky. – knub Jun 01 '17 at 11:09
  • @knub, AFAIK the solution proposed by ayhan and jezrael is currently the most idiomatic way to achieve that what you want – MaxU - stand with Ukraine Jun 01 '17 at 11:26
  • I mostly use MultiIndexes in similar situations. I keep one index equivalent of `np.arange(len(df.index)) // 3` and then group using that level (i.e. `df.groupby(level=something)`) But other than that, I don't believe pandas offers a built in method to solve this. – ayhan Jun 01 '17 at 12:15

1 Answers1

5

It might be a bit late, but here is my answer for everyone searching for a solution to this problem.

One solution would be to use pandas rolling(n) sliding window functionality and then select every nth value. e.G. for n=3

df_sub = df.rolling(3).mean()[::3]

this is a bit wasteful for calculation, since you recalculate the whole dataframe and then just keep 1/n percent of it.

Another similar approach to the problem, wich is not calculating the mean, but instead interpolating the whole dataframe column wise would be to use numpy's interp1 function.

e.G.: Assuming you have a DataFrame, where the indices are are monotonically increasing numerical/timestamped values (as usually with time series data), and you want to adjust every column individually you could do it like this:

def resample_fixed(df, n_new):
    n_old, m = df.values.shape
    mat_old = df.values
    mat_new = np.zeros((n_new, m))
    x_old = np.linspace(df.index.min(), df.index.max(), n_old)
    x_new = np.linspace(df.index.min(), df.index.max(), n_new)

    for j in range(m):
        y_old = mat_old[:, j]
        y_new = np.interp(x_new, x_old, y_old)
        mat_new[:, j] = y_new

    return pd.DataFrame(mat_new, index=x_new, columns=df.columns)

be careful though, interp1 does alter your data, since it linearly interpolates your datapoints. I would recommend inspecting the result after interpolation.

You can find a full example on the interpolation in a gist file I did for that here.

Tobi
  • 51
  • 1
  • 5