11

I have a dask dataframe and a dask array with the same number of rows in the same logical order. The dataframe rows are indexed by strings. I am trying to add one of the array columns to the dataframe. I have tried several ways all of which failed in their particular way.

df['col'] = da.col
# TypeError: Column assignment doesn't support type Array

df['col'] = da.to_frame(columns='col')
# TypeError: '<' not supported between instances of 'str' and 'int'

df['col'] = da.to_frame(columns=['col']).set_index(df.col).col
# TypeError: '<' not supported between instances of 'str' and 'int'

df = df.reset_index()
df['col'] = da.to_frame(columns='col')
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

and a few other variants.

What is the right way to add a dask array column to a dask dataframe when the structures are logically compatible?

Ravi Garg
  • 1,378
  • 12
  • 23
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
  • Looks like a duplicate. See [here](https://stackoverflow.com/a/46951629/4077912) – Primer Jan 09 '18 at 07:19
  • 1
    @Primer This is about a dask.array column. The other question is about adding a a numpy.array column. I have figured that out and for small data I can do `da.compute()` and use that, but I want to avoid the `da.compute()`. – Daniel Mahler Jan 09 '18 at 08:13
  • I think `TypeError: Column assignment doesn't support type ...` is the common denominator in this case, not the type of the data itself. Was hoping this would help you. – Primer Jan 09 '18 at 08:36
  • It is not a duplicate of (here)[https://stackoverflow.com/questions/46923274/appending-new-column-to-dask-dataframe/46951629#46951629]. The question here is how to add a dask array not a numpy array. – jeromerg Dec 04 '18 at 09:13
  • 1
    @DanielMahler Have you found a solution? I have the same problem, resulting of using dask-ml `KMeans`: `KMeans.fit()` returns a dask array of the cluster-labels, which I would like to integrate back to the source DataFrame – jeromerg Dec 04 '18 at 09:16
  • @jeromerg Just wondering if you found the solution. I have the exact same situation here. – Andrey Kuehlkamp Jul 11 '19 at 01:12

2 Answers2

0

The solution is to take out the index column of the original Dask dataframe as plain pandas dataframe, add the Dask array column to it, and then merge it back to the Dask dataframe by the index column

index_col = df['index'].compute()
index_col['new_col'] = da.col.compute()
df = df.merge(index_col, 'left', on='index')
Madcat
  • 379
  • 2
  • 7
0

This does seem to work as of dask version 2021.4.0, and possibly earlier. Just make sure the number of dataframe partitions matches the number of array chunks.

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
ddf = dd.from_pandas(pd.DataFrame({'z': np.arange(100, 104)}),
                     npartitions=2)
ddf['a'] = da.arange(200,204, chunks=2)
print(ddf.compute())

Output:

     z    a
0  100  200
1  101  201
2  102  202
3  103  203
HoosierDaddy
  • 720
  • 6
  • 19