Appending new column to dask dataframe

Question

This is a follow up question to Shuffling data in dask.

I have an existing dask dataframe df where I wish to do the following:

df['rand_index'] = np.random.permutation(len(df))

However, this gives the error, Column assignment doesn't support type ndarray. I tried to use df.assign(rand_index = np.random.permutation(len(df)) which gives the same error.

Here is a minimal (not) working sample:

import pandas as pd
import dask.dataframe as dd
import numpy as np

df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*10, 'B':[3,2,1]*10}), npartitions=10)
df['rand_index'] = np.random.permutation(len(df))

Note:

The previous question mentioned using df = df.map_partitions(add_random_column_to_pandas_dataframe, ...) but I'm not sure if that is relevant to this particular case.

Edit 1

I attempted df['rand_index'] = dd.from_array(np.random.permutation(len_df)) which, executed without an issue. When I inspected df.head() it seems that the new column was created just fine. However, when I look at df.tail() the rand_index is a bunch of NaNs.

In fact just to confirm I checked df.rand_index.max().compute() which turned out to be smaller than len(df)-1. So this is probably where df.map_partitions comes into play as I suspect this is an issue with dask being partitioned. In my particular case I have 80 partitions (not referring to the sample case).

score 9 · Answer 1 · answered Oct 26 '17 at 10:10

You would need to turn np.random.permutation(len(df)) into type that dask understands:

permutations = dd.from_array(np.random.permutation(len(df)))
df['rand_index'] = permutations
df

This would yield:

Dask DataFrame Structure:
                    A      B rand_index
npartitions=10                         
0               int64  int64      int32
3                 ...    ...        ...
...               ...    ...        ...
27                ...    ...        ...
29                ...    ...        ...
Dask Name: assign, 61 tasks

So it is up to you now if you want to .compute() to calculate actual results.

This is a problem if the indices are not the same (e.g. the original df has a datetime index and the new `Series` has a int index) — skibee, Mar 11 '19 at 16:18

score 0 · Answer 2 · answered Oct 25 '17 at 12:41

0

To assign a column you should use df.assign

answered Oct 25 '17 at 12:41

rpanai

12,515
2
42
64

1

I'm afraid this didn't work. Doing `df = df.assign(rand_index2=dd.from_array(np.random.permutation(len_df)))` still gave me `NaN`s in `df.tail()` – sachinruk Oct 25 '17 at 21:57
1

Which dask version are you using? `df = df.assign(rand_index=dd.from_array(np.random.permutation(len(df))))` works for me. – rpanai Oct 26 '17 at 19:23
version `0.15.3` – sachinruk Oct 27 '17 at 01:54

score 0 · Answer 3 · answered Mar 13 '19 at 10:16

Got the same problem as in Edit 1.

My work around is to get a unique column from the existing dataframe and feed into the dataframe that is to be appended.

import dask.dataframe as dd
import dask.array as da
import numpy as np
import panda as pd

df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*2, 'B':[3,2,1]*2, 'idx':[0,1,2,3,4,5]}), npartitions=10)
chunks = tuple(df.map_partitions(len).compute())
size = sum(chunks)
permutations = da.from_array(np.random.permutation(len(df)), chunks=chunks)
idx = da.from_array(df['idx'].compute(), chunks=chunks)
ddf = dd.concat([dd.from_dask_array(c) for c in [idx,permutations]], axis = 1)
ddf.columns = ['idx','rand_idx']
df = df.merge(ddf, on='idx')
df = df.set_index('rand_idx')
df.compute().head()

Appending new column to dask dataframe

Note:

Edit 1

3 Answers3

Linked