Convert dataframe column string values into dummy variable columns

Question

I have the following dataframe (excluded rest of columns):

| customer_id | department                    |
| ----------- | ----------------------------- |
| 11          | ['nail', 'men_skincare']      |
| 23          | ['nail', 'fragrance']         |
| 25          | []                            |
| 45          | ['skincare', 'men_fragrance'] |

I am working on preprocessing my data to be fit into a model. I want to turn the department variable into dummy variables for each unique department category (for however many unique departments there could be, not just limited to what is here).

Want to get this result:

| customer_id | department                    | nail | men_skincare | fragrance | skincare | men_fragrance |
| ----------- | ----------                    | ---- | ------------ | --------- | -------- | ------------- |
| 11          | ['nail', 'men_skincare']      | 1    | 1            | 0         | 0        | 0             |
| 23          | ['nail', 'fragrance']         | 1    | 0            | 1         | 0        | 0             |
| 25          | []                            | 0    | 0            | 0         | 0        | 0             |
| 45          | ['skincare', 'men_fragrance'] | 0    | 0            | 0         | 1        | 1             |

I have tried this link, but when i splice it, it treats it as if its a string and only creates a column for each character in the string; what i used:

df['1st'] = df['department'].str[0]
df['2nd'] = df['department'].str[1]
df['3rd'] = df['department'].str[2]
df['4th'] = df['department'].str[3]
df['5th'] = df['department'].str[4]
df['6th'] = df['department'].str[5]
df['7th'] = df['department'].str[6]
df['8th'] = df['department'].str[7]
df['9th'] = df['department'].str[8]
df['10th'] = df['department'].str[9]

I then tried to split the strings and turn into a list using:

df['new_column'] = df['department'].apply(lambda x: x.split(","))

Then tried it again and still did the same thing of only creating columns for each character.

Any suggestions?

Edit: I found the answer using the link that anky sent over, specifically i used this one: https://stackoverflow.com/a/29036042

What worked for me:

df['department'] = df['department'].str.replace("'",'').str.replace("]",'').str.replace("[",'').str.replace(' ','')
df['department'] = df['department'].apply(lambda x: x.split(","))
s = df['department']
df1 = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.merge(df, df1, right_index=True, left_index=True, how = 'left')

Welcome to Stack Overflow. Please read how to ask good [questions](https://stackoverflow.com/help/how-to-ask). Make sure your question covers these 3 elements: 1. Problem Statement 2. Your Code (it should be [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) 3. Error Message (preferably full Traceback to help others review and provide feedback). Sometimes the same question may have already been asked. Make sure your question is not a [duplicate](https://stackoverflow.com/help/duplicates) — Joe Ferndz, Apr 25 '21 at 01:32
Does this help? https://stackoverflow.com/a/51420716/9840637 — anky, Apr 25 '21 at 02:25
@anky, yes that link was helpful, i specifically used this one; https://stackoverflow.com/a/29036042 — TealSeal, Apr 25 '21 at 03:59

Anurag Dabas · Answer 1 · 2021-04-25T02:30:16.150

3

import pandas as pd

You can do this by explode() ,value_counts() and fillna() method:

data=df.explode('department').fillna('empty')

Now use crosstab() method:

data=pd.crosstab(data['customer_id'],data['department'])

Since concat() method is giving you an error so use merge() method and drop() method:

data=pd.merge(df.set_index('customer_id'),data,left_index=True,right_index=True).drop(columns=['empty'])

Now if you print data you will get your desired output:

edited Apr 25 '21 at 02:30

answered Apr 25 '21 at 01:47

Anurag Dabas

23,866
9
21
41

Wow how did you know this so quickly! I got an error on the last code snipped: "object of type 'int' has no len()" – TealSeal Apr 25 '21 at 01:54
in `concat()` method? – Anurag Dabas Apr 25 '21 at 01:57
Yes, when running: data=pd.concat((df.set_index('customer_id'),data),axis=1).reset_index() – TealSeal Apr 25 '21 at 02:02
Updated answer with shorter code...Kindly check :) – Anurag Dabas Apr 25 '21 at 02:17
Now getting a different error, for the concat code :( ValueError: Shape of passed values is (179910, 5183), indices imply (35990, 5183) – TealSeal Apr 25 '21 at 02:19
I ran using the merge() method, it worked without errors but the result was not expected; it created a column for each unique combination of departments (ex. a column for "nails, mens_skincare", then a column for "nails" separately). So to me it seems like its just a string in the original df, so i stripped the [ ] and " and then turned the values of the department column into a list using: df['department'] = df['department'].apply(lambda x: x.split(",")); then I ran your suggestions again (the merge one) and it threw an error saying Index._join_level on non-unique index is not implemented – TealSeal Apr 25 '21 at 02:36
Not entirely, now a new problem showed up :( . I will try to tinker with it and fix it. – TealSeal Apr 25 '21 at 02:45
Instead of stripping `[]` and then splitting it by `df['department'].apply(lambda x: x.split(","))` use `df['department']=df['department'].apply(lambda x:eval(x))` and then try to repeat whole process – Anurag Dabas Apr 25 '21 at 02:52
hmm i ran it using eval(x) instead and then repeated the process, it keeps throwing an NotImplementedError: Index._join_level on non-unique index is not implemented, – TealSeal Apr 25 '21 at 02:58
try `data=pd.merge(df,data.reset_index(),how='left',on='customer_id').drop(columns=['empty'])` – Anurag Dabas Apr 25 '21 at 03:03
I used this and it worked!! https://stackoverflow.com/a/29036042 – TealSeal Apr 25 '21 at 03:56

score 0 · Answer 2 · answered Apr 25 '21 at 04:10

Try:

df.merge(pd.get_dummies(df.set_index('customer_id')
                          .explode('department'), 
                        prefix='', 
                        prefix_sep='').sum(level=0),
        left_on='customer_id', right_index=True)

Output:

   customer_id                 department  fragrance  men_fragrance  men_skincare  nail  skincare
0           11       [nail, men_skincare]          0              0             1     1         0
1           23          [nail, fragrance]          1              0             0     1         0
2           25                         []          0              0             0     0         0
3           45  [skincare, men_fragrance]          0              1             0     0         1

Convert dataframe column string values into dummy variable columns

2 Answers2