I'm trying to downsample Dask dataframes by any x number of rows.
For instance, if I was using datetimes as an index, I could just use:
df = df.resample('1h').ohlc()
But I don't want to resample by datetimes, I want to resample by a fixed number of rows...something like:
df = df.resample(rows=100).ohlc()
I did a bunch of searching and found these three old SO pages:
- This one suggests:
df.groupby(np.arange(len(df))//x)
, where x = the number of rows.pd.DataFrame(df.values.reshape(-1,2,df.shape[1]).mean(1))
, but I have trouble understanding this one.pd.DataFrame(np.einsum('ijk->ik',df.values.reshape(-1,2,df.shape[1]))/2.0)
, but I also have trouble understanding this one.
- This one suggests
df.groupby(np.arange(len(df))//x)
again. - This one suggests
df_sub = df.rolling(x).mean()[::x]
, but it says it's wasteful, and doesn't seem optimized for Dask.
The best, fastest option seems to be df.groupby(np.arange(len(df))//x)
, and it works fine in Pandas. However, when I try it in Dask, I get: ValueError: Grouper and axis must be same length
How do I resample by # of rows using Dask?
I have dataframes with:
- A standard index (e.g. 1,2,3,4,5...,n)
- Datetime values I could potentially use as an index (although I don't necessarily want to)
- Non-standard lengths (i.e. Some of them have an even number of rows, and some have an odd number).