This is a follow up question to Shuffling data in dask.
I have an existing dask dataframe df
where I wish to do the following:
df['rand_index'] = np.random.permutation(len(df))
However, this gives the error, Column assignment doesn't support type ndarray
. I tried to use df.assign(rand_index = np.random.permutation(len(df))
which gives the same error.
Here is a minimal (not) working sample:
import pandas as pd
import dask.dataframe as dd
import numpy as np
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*10, 'B':[3,2,1]*10}), npartitions=10)
df['rand_index'] = np.random.permutation(len(df))
Note:
The previous question mentioned using df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
but I'm not sure if that is relevant to this particular case.
Edit 1
I attempted
df['rand_index'] = dd.from_array(np.random.permutation(len_df))
which, executed without an issue. When I inspected df.head()
it seems that the new column was created just fine. However, when I look at df.tail()
the rand_index
is a bunch of NaN
s.
In fact just to confirm I checked df.rand_index.max().compute()
which turned out to be smaller than len(df)-1
. So this is probably where df.map_partitions
comes into play as I suspect this is an issue with dask being partitioned. In my particular case I have 80 partitions (not referring to the sample case).