23

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:

0    'a'
1    'a,b,c'
2    'a,b,d'
3    'd'
4    'c,d'

Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!

Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!

   a  b  c  d
0  1  0  0  0
1  1  1  1  0
2  1  1  0  1
3  0  0  0  1
4  0  0  1  1
breakbotz
  • 397
  • 1
  • 3
  • 8

2 Answers2

36

Use str.get_dummies

df['col'].str.get_dummies(sep=',')

    a   b   c   d
0   1   0   0   0
1   1   1   1   0
2   1   1   0   1
3   0   0   0   1
4   0   0   1   1

Edit: Updating the answer to address some questions.

Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it

Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.

If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix

df['col'].str.get_dummies(sep=',').add_prefix('col_')

Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame? You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.

df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)

  other a   b   c   d
0   x   1   0   0   0
1   y   1   1   1   0
2   x   1   1   0   1
3   x   0   0   0   1
4   q   0   0   1   1
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • I feel stupid.... but this is exactly what I was trying to do. Thank you! – breakbotz Oct 21 '17 at 19:44
  • You shouldn't. Very few know all the functions that are available, rest of us are at different stages of learning :) All the best – Vaishali Oct 21 '17 at 19:47
  • 3
    This might be obvious, but if your data is separated by a comma and a space, make sure to include it! That is, `sep = ', '` Otherwise, you end up with duplicate columns. – Huey Dec 19 '17 at 18:00
  • 1
    why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it? – Amitai Jun 05 '18 at 10:18
  • This is what i'm also trying to do but due to large data and variability among it, it's giving memory error. Any method to get out of it? – Tarun Feb 13 '19 at 11:28
  • If you have more than one column to begin with, how do you merge the dummies back into the original frame? – Alex R Nov 22 '20 at 06:18
  • 1
    Great answer, thanks for including the concat portion at the end. Saved me a ton of time. – bogus Feb 11 '21 at 19:03
5

The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:

data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')
micmia
  • 1,371
  • 1
  • 14
  • 29