I have the following dataframe (excluded rest of columns):
| customer_id | department |
| ----------- | ----------------------------- |
| 11 | ['nail', 'men_skincare'] |
| 23 | ['nail', 'fragrance'] |
| 25 | [] |
| 45 | ['skincare', 'men_fragrance'] |
I am working on preprocessing my data to be fit into a model. I want to turn the department variable into dummy variables for each unique department category (for however many unique departments there could be, not just limited to what is here).
Want to get this result:
| customer_id | department | nail | men_skincare | fragrance | skincare | men_fragrance |
| ----------- | ---------- | ---- | ------------ | --------- | -------- | ------------- |
| 11 | ['nail', 'men_skincare'] | 1 | 1 | 0 | 0 | 0 |
| 23 | ['nail', 'fragrance'] | 1 | 0 | 1 | 0 | 0 |
| 25 | [] | 0 | 0 | 0 | 0 | 0 |
| 45 | ['skincare', 'men_fragrance'] | 0 | 0 | 0 | 1 | 1 |
I have tried this link, but when i splice it, it treats it as if its a string and only creates a column for each character in the string; what i used:
df['1st'] = df['department'].str[0]
df['2nd'] = df['department'].str[1]
df['3rd'] = df['department'].str[2]
df['4th'] = df['department'].str[3]
df['5th'] = df['department'].str[4]
df['6th'] = df['department'].str[5]
df['7th'] = df['department'].str[6]
df['8th'] = df['department'].str[7]
df['9th'] = df['department'].str[8]
df['10th'] = df['department'].str[9]
I then tried to split the strings and turn into a list using:
df['new_column'] = df['department'].apply(lambda x: x.split(","))
Then tried it again and still did the same thing of only creating columns for each character.
Any suggestions?
Edit: I found the answer using the link that anky sent over, specifically i used this one: https://stackoverflow.com/a/29036042
What worked for me:
df['department'] = df['department'].str.replace("'",'').str.replace("]",'').str.replace("[",'').str.replace(' ','')
df['department'] = df['department'].apply(lambda x: x.split(","))
s = df['department']
df1 = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.merge(df, df1, right_index=True, left_index=True, how = 'left')