1

I have a column in a dask data frame that contains comma separated lists of different categories. I'm looking to replicate the functionality of sklearn's MultiLabelBinarizer or the pandas function pd.get_dummies(',') exactly as this thread describes: Create dummies from column with multiple values in dask

Is there absolutely no way to do this as the one answer there states? Is there a way to implement this if I got a list of all of the values?

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
jxo
  • 45
  • 4

1 Answers1

2

If the list of all classes are known, then it's an easy task for dask:

import dask.dataframe as dd
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame({"col_a": ["c, d", "e", "g", "e, g", "d, e"]})
all_classes = ["c", "d", "e", "g"]
mlb = MultiLabelBinarizer(classes=all_classes)

def myfunc(df):
    return pd.DataFrame(mlb.fit_transform(df["col_a"].values), columns=all_classes)

ddf = dd.from_pandas(df, npartitions=2)

ddf.map_partitions(myfunc, meta=pd.DataFrame(columns=all_classes)).compute()

If the list is not known, then one option is to do a first pass through the dataframe, collecting all unique values, then integrating these classes into a snippet similar to above.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46