0

I have a data with movie genres. the genres are separated with " "

id   genres
0   drama romance
1   drama
2   comedy
3   mystery thriller
4   crime thriller
...

And I want to split them by genres(about 20)

id   drama romance comedy...
0     1      1      0
1     1      0      0
2     0      0      1
3     0      0      0
4     0      0      0
...

I was thinking about getting dummies but I don't think it can help.

NKG
  • 69
  • 1
  • 7
  • 3
    Take a look at `split`, `explode` and `crosstab` or `get_dummies` – rafaelc Oct 27 '20 at 13:14
  • `pd.get_dummies(df.set_index('id'))` would be the simpliest solution imo. – Umar.H Oct 27 '20 at 13:16
  • Use `df.join(df.pop('text').str.get_dummies(' '))` – jezrael Oct 27 '20 at 13:23
  • 1
    It looks like neither of the linked answers fully answers your question because your df contains a space separated list of genres. Riccardo's answer is good. Both the other examples have multiple columns. This should not have been closed. – B. Bogart Oct 27 '20 at 13:44

1 Answers1

1

Here is a possible solution (df is your dataframe):

pd.merge(df[['id']], pd.get_dummies(df.genres.str.split().explode()),
         left_on='id', right_index=True).groupby('id').sum()

Here is an example:

>>> df = pd.DataFrame({'id': [0,1,2,3,4], 'genres': ['drama romance', 'drama', 'comedy', 'mystery thriller', 'crime thriller']})
>>> pd.merge(df[['id']], pd.get_dummies(df.genres.str.split().explode()), left_on='id', right_index=True).groupby('id').sum()
    comedy  crime  drama  mystery  romance  thriller
id                                                  
0        0      0      1        0        1         0
1        0      0      1        0        0         0
2        1      0      0        0        0         0
3        0      0      0        1        0         1
4        0      1      0        0        0         1
Riccardo Bucco
  • 13,980
  • 4
  • 22
  • 50