2

I have a dataframe like the following:

In[8]: df = pd.DataFrame({'transport': ['Car;Bike;Horse','Car','Car;Bike', 'Horse;Car']})
df
Out[8]:
    transport
 0  Car;Bike;Horse
 1  Car
 2  Car;Bike
 3  Horse;Car

And I want to convert it to something like this:

In[9]: df2 = pd.DataFrame({'transport_car': [True,True,True,True],'transport_bike': [True,False,True,False], 'transport_horse': [True,False,False,True]} )
df2
Out[10]:
  transport_bike    transport_car   transport_horse
0   True                True            True
1   False               True            False
2   True                True            False
3   False               True            True

I got a solution, but it feels very 'hacked' and 'unpythonic'. (It works for my considerably small data set)

In[11]:
# get set of all possible values
new_columns = set()
for element in set(df.transport.unique()):
    for transkey in str(element).split(';'):
        new_columns.add(transkey)
print(new_columns)

# Use broadcast to initialize all columns with default value.
for col in new_columns:
    df['trans_'+str(col).lower()] = False

# Change cells appropiate to keywords
for index, row in df.iterrows():
    for key in new_columns:
        if key in row.transport:
            df.set_value(index, 'trans_'+str(key).lower(), True)
df

Out[11]:
    transport     trans_bike    trans_car   trans_horse
0   Car;Bike;Horse  True          True          True
1   Car             False         True          False
2   Car;Bike        True          True          False
3   Horse;Car       False         True          True

My goal is to use the second representation to perform some evaluation to answer questions like: "How often is car used?", "How often is car used together with horse", etc.

This and this answers suggest using pivot and eval might be the way to go, but I'm not sure.

So what would be the best way, to convert a DataFrame from first representation to the second?

Community
  • 1
  • 1
Arvodan
  • 1,205
  • 3
  • 13
  • 19

3 Answers3

3

You can use apply and construct a Series for each entry with the splited fields as index. This will result in a data frame with the index as the columns:

df.transport.apply(lambda x: pd.Series(True, x.split(";"))).fillna(False)

enter image description here

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 2
    Very nice answer. Might as well finish it off with `pd.concat([df, df.transport.apply(lambda x: pd.Series(True, ['trans_' + col for col in x.split(";")])).fillna(False)], axis=1)` – Ted Petrou Jan 04 '17 at 21:49
  • In the end, I used Psidorn's answer with the extension from Ted-Petrou. – Arvodan Jan 05 '17 at 14:43
2

I decided to extend the great @Metropolis's answer with a working example:

In [249]: %paste
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(df.transport.str.replace(';',' '))

r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
## -- End pasted text --

In [250]: r
Out[250]:
   bike  car  horse
0     1    1      1
1     0    1      0
2     1    1      0
3     0    1      1

now you can join it back to the source DF:

In [251]: df.join(r)
Out[251]:
        transport  bike  car  horse
0  Car;Bike;Horse     1    1      1
1             Car     0    1      0
2        Car;Bike     1    1      0
3       Horse;Car     0    1      1

Timing: for 40K rows DF:

In [254]: df = pd.concat([df] * 10**4, ignore_index=True)

In [255]: df.shape
Out[255]: (40000, 1)

In [256]: %timeit df.transport.apply(lambda x: pd.Series(True, x.split(";"))).fillna(False)
1 loop, best of 3: 33.8 s per loop

In [257]: %%timeit
     ...: vectorizer = CountVectorizer(min_df=1)
     ...: X = vectorizer.fit_transform(df.transport.str.replace(';',' '))
     ...: r = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
     ...:
1 loop, best of 3: 732 ms per loop
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

I would consider using the Count Vectorizer provided by Scikit-learn. The vectorizer will construct a vector where each index refers to a term and the value refers to the number of appearances of that term in the record.

Advantages over the home-rolled approaches suggested in other answer are efficiency for large datasets and generalizability. Disadvantage is, obviously, bringing in an extra dependency.

Metropolis
  • 2,018
  • 1
  • 19
  • 36