Shuffling data in dask

Question

This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.

The answer in that question was to do the following:

for part in df.repartition(npartitions=100).to_delayed():
    batch = part.compute()

However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.

What I would ideally like is something along the lines of:

rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]

which would work on pandas but not dask. Any thoughts?

Edit 1: Potential Solution

I tried doing

train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]

However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.

score 6 · Answer 1 · answered Oct 20 '17 at 11:43

6

I recommend adding a column of random data to your dataframe and then using that to set the index:

df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')

answered Oct 20 '17 at 11:43

MRocklin

55,641
23
163
235

3

Could I not have simply done `df['rand_index'] = np.random.permutation(len(df))` for the first line? – sachinruk Oct 24 '17 at 02:42
Seems like this behaviour should be built in. So we can automatically add a random column that isn't used and then shuffle the entire dataframe. Shuffling each individual partition is not enough, one must shuffle across the entire dataframe. – CMCDragonkai Oct 28 '19 at 06:10
Related about this solution: https://stackoverflow.com/questions/46923274/appending-new-column-to-dask-dataframe?rq=1 – Wilem2 Dec 31 '20 at 16:29

score 2 · Answer 2 · edited Feb 13 '20 at 11:55

I encountered the same issue recently and came up with a different approach using dask array and shuffle_slice introduced in this pull request

It shuffles the whole sample

import numpy as np
from dask.array.slicing import shuffle_slice

d_arr = df.to_dask_array(True)
df_len = len(df)
np.random.seed(42)
index = np.random.choice(df_len, df_len, replace=False)
d_arr = shuffle_slice(d_arr, index)

and to transform back to dask dataframe

df = d_arr.to_dask_dataframe(df.columns)

for me it works well for large data sets

FabienP · Answer 3 · 2017-11-03T15:04:10.673

If you're trying to separate your dataframe into training and testing subsets, it is what does sklearn.model_selection.train_test_split and it works with pandas.DataFrame. (Go there for an example)

And for your case of using it with dask, you may be interested by the library dklearn, that seems to implements this function.

To do that, we can use the train_test_split function, which mirrors the scikit-learn function of the same name. We'll hold back 20% of the rows:
from dklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)

More information here.

Note: I did not perform any test with dklearn, this is just a thing I came across, but I hope it can help.

EDIT: what about dask.DataFrame.random_split?

Examples

50/50 split
>>> a, b = df.random_split([0.5, 0.5])
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)

Use for ML applications is illustrated here

Hi Fabien, I'm afraid dklearn doesn't exist anymore. I actually saw that blogpost but couldn't get much out of it for that reason. — sachinruk, Nov 01 '17 at 00:39
@Sachin_ruk , I think I found another solution (builtin from `dask.DataFrame`), see my edit. — FabienP, Nov 03 '17 at 15:06

DsCpp · Answer 4 · 2020-04-28T03:11:30.473

0

For people here really just wanting to shuffle the rows as the title implies:
This is costly

import numpy as np
random_idx = np.random.permutation(len(sd.index))
sd.assign(random_idx=random_idx)
sd = sd.set_index('x', sorted=True)

edited Apr 28 '20 at 03:11

answered Apr 27 '20 at 16:46

DsCpp

2,259
3
18
46

Shuffling data in dask

Edit 1: Potential Solution

4 Answers4

Linked