-1

I have this DataFrame:

classification       text      apple  banana  peach   grape
["apple","grape"]    anytext    NaN     NaN    NaN     NaN

How can I check if the column name is in the classification column, to get this:

classification        text      apple  banana  peach   grape
["apple","grape"]    anytext      1      0       0       1

Data:

{'classification': [['apple', 'grape']],
 'text': ['anytext'],
 'apple': [nan],
 'banana': [nan],
 'peach': [nan],
 'grape': [nan]}
pouchewar
  • 399
  • 1
  • 10
  • [MultiLabelBinarizer](https://stackoverflow.com/a/51420716/4983450) would be pretty efficient on this. – Psidom Feb 26 '22 at 01:10

1 Answers1

2

You could apply a lambda function on "classification" that looks checks if an item in it exists as a column name:

cols = ['apple','banana','peach','grape']
df[cols] = df['classification'].apply(lambda x: [1 if col in x else 0 for col in cols]).tolist()

Another option is to explode + stack + fillna to get a blank Series where the MultiIndex consists of the index, "classification" and column names of df. Then evaluate if any item in "classification" exists as a column name, create a Series, unstack + groupby + sum to build a DataFrame to assign back to df:

tmp = df.explode('classification')
s = tmp.set_index([tmp.index, tmp['classification']])[cols].fillna(0).stack()
s = pd.Series((s.index.get_level_values(1)==s.index.get_level_values(2)).astype(int), index=s.index)
df[cols] = s.unstack().groupby(level=0).sum()

Yet even simpler is to use explode + pd.get_dummies + groupby + sum to get the items in "classification" as dummy variables, then update df with it using fillna:

df[cols] = df[cols].fillna(pd.get_dummies(df['classification'].explode()).groupby(level=0).sum()).fillna(0)

Output:

   classification     text  apple  banana  peach  grape
0  [apple, grape]  anytext      1       0      0      1