3

I'm trying to downsample Dask dataframes by any x number of rows.

For instance, if I was using datetimes as an index, I could just use:

df = df.resample('1h').ohlc()

But I don't want to resample by datetimes, I want to resample by a fixed number of rows...something like:

df = df.resample(rows=100).ohlc()

I did a bunch of searching and found these three old SO pages:

  • This one suggests:
    • df.groupby(np.arange(len(df))//x), where x = the number of rows.
    • pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1)), but I have trouble understanding this one.
    • pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0), but I also have trouble understanding this one.
  • This one suggests df.groupby(np.arange(len(df))//x) again.
  • This one suggests df_sub = df.rolling(x).mean()[::x], but it says it's wasteful, and doesn't seem optimized for Dask.

The best, fastest option seems to be df.groupby(np.arange(len(df))//x), and it works fine in Pandas. However, when I try it in Dask, I get: ValueError: Grouper and axis must be same length

How do I resample by # of rows using Dask?

I have dataframes with:

  • A standard index (e.g. 1,2,3,4,5...,n)
  • Datetime values I could potentially use as an index (although I don't necessarily want to)
  • Non-standard lengths (i.e. Some of them have an even number of rows, and some have an odd number).

0 Answers0