I have a dataframe like the following:
In[8]: df = pd.DataFrame({'transport': ['Car;Bike;Horse','Car','Car;Bike', 'Horse;Car']})
df
Out[8]:
transport
0 Car;Bike;Horse
1 Car
2 Car;Bike
3 Horse;Car
And I want to convert it to something like this:
In[9]: df2 = pd.DataFrame({'transport_car': [True,True,True,True],'transport_bike': [True,False,True,False], 'transport_horse': [True,False,False,True]} )
df2
Out[10]:
transport_bike transport_car transport_horse
0 True True True
1 False True False
2 True True False
3 False True True
I got a solution, but it feels very 'hacked' and 'unpythonic'. (It works for my considerably small data set)
In[11]:
# get set of all possible values
new_columns = set()
for element in set(df.transport.unique()):
for transkey in str(element).split(';'):
new_columns.add(transkey)
print(new_columns)
# Use broadcast to initialize all columns with default value.
for col in new_columns:
df['trans_'+str(col).lower()] = False
# Change cells appropiate to keywords
for index, row in df.iterrows():
for key in new_columns:
if key in row.transport:
df.set_value(index, 'trans_'+str(key).lower(), True)
df
Out[11]:
transport trans_bike trans_car trans_horse
0 Car;Bike;Horse True True True
1 Car False True False
2 Car;Bike True True False
3 Horse;Car False True True
My goal is to use the second representation to perform some evaluation to answer questions like: "How often is car used?", "How often is car used together with horse", etc.
This and this answers suggest using pivot
and eval
might be the way to go, but I'm not sure.
So what would be the best way, to convert a DataFrame from first representation to the second?