3

I have a source system that gives me data like this:

Name    |Hobbies
----------------------------------
"Han"   |"Art;Soccer;Writing"
"Leia"  |"Art;Baking;Golf;Singing"
"Luke"  |"Baking;Writing"

Each hobby list is semicolon delimited. I want to turn this into a table like structure with a column for each hobby and a flag to indicate if a person selected that hobby:

Name    |Art     |Baking  |Golf    |Singing |Soccer  |Writing  
--------------------------------------------------------------
"Han"   |1       |0       |0       |0       |1       |1
"Leia"  |1       |1       |1       |1       |0       |0
"Luke"  |0       |1       |0       |0       |0       |1

Here's code to generate the sample data in a pandas dataframe:

>>> import pandas as pd
>>> df = pd.DataFrame(
...     [
...         {'name': 'Han',   'hobbies': 'Art;Soccer;Writing'},
...         {'name': 'Leia',  'hobbies': 'Art;Baking;Golf;Singing'},
...         {'name': 'Luke',  'hobbies': 'Baking;Writing'},
...     ]
... )
>>> df
                   hobbies  name
0       Art;Soccer;Writing   Han
1  Art;Baking;Golf;Singing  Leia
2           Baking;Writing  Luke

Right now, I'm using the following code to get the data into a datatrame that has the structure I want, but it is really slow (my actual data set has about 1.5 million rows):

>>> df2 = pd.DataFrame(columns=['name', 'hobby'])
>>>
>>> for index, row in df.iterrows():
...     for value in str(row['hobbies']).split(';'):
...         d = {'name':row['name'], 'value':value}
...         df2 = df2.append(d, ignore_index=True)
...
>>> df2 = df2.groupby('name')['value'].value_counts()
>>> df2 = df2.unstack(level=-1).fillna(0)
>>>
>>> df2
value  Art  Baking  Golf  Singing  Soccer  Writing
name
Han    1.0     0.0   0.0      0.0     1.0      1.0
Leia   1.0     1.0   1.0      1.0     0.0      0.0
Luke   0.0     1.0   0.0      0.0     0.0      1.0

Is there a more efficient way to do this?

Georgy
  • 12,464
  • 7
  • 65
  • 73
Andy
  • 33
  • 1
  • 3
  • Do you know the full set of possible hobbies? If not, it might be more effective to leave hobbies as a single column and then for each hobby, have a row for that character (so Han would have three rows with Art, Soccer, and Writing as the single hobby per row. – RagingRoosevelt Dec 14 '17 at 16:58
  • Unfortunately, no. The source system allows values to be inserted (through code and data loads) that are not in the actual picklist dropdown. – Andy Dec 14 '17 at 17:27

3 Answers3

3

Why not just change the DataFrame in place?

for idx, row in df.iterrows():
    for hobby in row.hobbies.split(";"):
        df.loc[idx, hobby] = True

df.fillna(False, inplace=True)
havanagrawal
  • 1,039
  • 6
  • 12
  • Thank you! That improved performance by magnitudes. I tested on a 5000 row sample set, and your "in place" code was ten times faster. – Andy Dec 14 '17 at 18:37
2

What you could do is instead of appending columns on every iteration append all of them after running your loop:

df3 = pd.DataFrame(columns=['name', 'hobby'])
d_list = []

for index, row in df.iterrows():
    for value in str(row['hobbies']).split(';'):
        d_list.append({'name':row['name'], 
                       'value':value})
df3 = df3.append(d_list, ignore_index=True)
df3 = df3.groupby('name')['value'].value_counts()
df3 = df3.unstack(level=-1).fillna(0)
df3

I checked how much time it would take for you example dataframe. With the improvement I suggest it's ~50 times faster.

Georgy
  • 12,464
  • 7
  • 65
  • 73
  • 1
    In my testing, it looks to be almost 70 times faster than my initial approach. Awesome! – Andy Dec 14 '17 at 18:56
  • On closer inspection, while it *is* a lot faster, it doesn't actually work: it doesn't create and add the columns. – Andy Dec 14 '17 at 19:05
  • It should work. I updated my answer with all the code that should produce the result except definition of `df`. Check again – Georgy Dec 14 '17 at 19:13
  • Sorry, I was at meetings and didn't get a chance to check for updates. Testing on 5,000, it takes about 500 ms. The code I had concocted took almost 30 seconds, so yours clocks in at about 60 times faster. My actual production dataset has 2,416,664 rows, so (based on 500 ms for 5,000) this *should* complete in around four minutes. I kicked it off wrapped in a timeit, but if it has to run three times to give me the best time, it'll be a little bit before it completes. I post another update in a while. – Andy Dec 14 '17 at 21:24
  • 1
    So, the best of three with the full dataset was 3min 37s (for the loops, not the append, groupby, and unstack), which is fantastic. My code was looking at clocking in at just under four hours! As I understand the code, the difference is only when and how the data for the new columns is added to the new dataset. By putting the data into list of dictionaries, then appending that to the dataframe, you **massively** improved the performance! Thank you. :) – Andy Dec 14 '17 at 21:45
1

Actually, using .str.split and .melt should be slighter faster then looping with iterrows.

  1. Splitting to multiple columns:

    >>> df = pd.DataFrame([{'name': 'Han', 'hobbies': 'Art;Soccer;Writing'}, 
                           {'name': 'Leia', 'hobbies': 'Art;Baking;Golf;Singing'},
                           {'name': 'Luke', 'hobbies': 'Baking;Writing'}])
    >>> hobbies = df['hobbies'].str.split(';', expand=True)
    >>> hobbies
        0          1       2       3
    0 Art     Soccer Writing    None
    1 Art     Baking    Golf Singing
    2 Baking Writing    None    None 
    
  2. Unpivoting hobbies by names:

    >>> df = df.drop('hobbies', axis=1)
    >>> df = df.join(hobbies)
    >>> stacked = df.melt('name', value_name='hobby').drop('variable', axis=1)
    >>> stacked
       name   hobby
     0  Han     Art
     1 Leia     Art
     2 Luke  Baking
     3  Han  Soccer
     4 Leia  Baking
     5 Luke Writing
     6  Han Writing
     7 Leia    Golf
     8 Luke    None
     9  Han    None
    10 Leia Singing
    11 Luke    None
    
  3. Counting the values:

    >>> counts = stacked.groupby('name')['hobby'].value_counts()
    >>> result = counts.unstack(level=-1).fillna(0).astype(int)
    >>> result
    hobby Art Baking Golf Singing Soccer Writing
    name                        
     Han    1      0    0       0      1       1
    Leia    1      1    1       1      0       0
    Luke    0      1    0       0      0       1
    

There are alternatives to steps 2 and 3, like using get_dummies or crosstab, as discussed here: Pandas get_dummies on multiple columns, but the first one will eat your memory, and the second one is much slower.


References:
Pandas split column into multiple columns by comma
Pandas DataFrame stack multiple column values into single column

Georgy
  • 12,464
  • 7
  • 65
  • 73