I have a source system that gives me data like this:
Name |Hobbies
----------------------------------
"Han" |"Art;Soccer;Writing"
"Leia" |"Art;Baking;Golf;Singing"
"Luke" |"Baking;Writing"
Each hobby list is semicolon delimited. I want to turn this into a table like structure with a column for each hobby and a flag to indicate if a person selected that hobby:
Name |Art |Baking |Golf |Singing |Soccer |Writing
--------------------------------------------------------------
"Han" |1 |0 |0 |0 |1 |1
"Leia" |1 |1 |1 |1 |0 |0
"Luke" |0 |1 |0 |0 |0 |1
Here's code to generate the sample data in a pandas dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame(
... [
... {'name': 'Han', 'hobbies': 'Art;Soccer;Writing'},
... {'name': 'Leia', 'hobbies': 'Art;Baking;Golf;Singing'},
... {'name': 'Luke', 'hobbies': 'Baking;Writing'},
... ]
... )
>>> df
hobbies name
0 Art;Soccer;Writing Han
1 Art;Baking;Golf;Singing Leia
2 Baking;Writing Luke
Right now, I'm using the following code to get the data into a datatrame that has the structure I want, but it is really slow (my actual data set has about 1.5 million rows):
>>> df2 = pd.DataFrame(columns=['name', 'hobby'])
>>>
>>> for index, row in df.iterrows():
... for value in str(row['hobbies']).split(';'):
... d = {'name':row['name'], 'value':value}
... df2 = df2.append(d, ignore_index=True)
...
>>> df2 = df2.groupby('name')['value'].value_counts()
>>> df2 = df2.unstack(level=-1).fillna(0)
>>>
>>> df2
value Art Baking Golf Singing Soccer Writing
name
Han 1.0 0.0 0.0 0.0 1.0 1.0
Leia 1.0 1.0 1.0 1.0 0.0 0.0
Luke 0.0 1.0 0.0 0.0 0.0 1.0
Is there a more efficient way to do this?