4

I have a pandas DataFrame and a column contained a string that was separated by a pipe. These were from movie genres. They looked like this:

Genre
Adventure|Animation|Children|Comedy|Fantasy
Comedy|Romance
...

I used str.split to get them back into the cell as a List. Like this:

Genre 
[Adventure, Animation, Children, Comedy, Fantasy]
[Adventure, Children, Fantasy]
[Comedy, Romance]
[Comedy, Drama, Romance]
[Comedy]

I want to get a sum of all the genres. For example how many times did Comedy appear? How many times did Adventure and so on? I can't seem to figure this out.

This would look like

Comedy    4
Adventure 2
Animation 1
(...and so on...)
cs95
  • 379,657
  • 97
  • 704
  • 746
broepke
  • 71
  • 2
  • 6

2 Answers2

4

As somebody from the for-loop club, I recommend using python's C-accelerated routines—itertools.chain, and collections.Counter—for performance.

from itertools import chain
from collections import Counter

pd.Series(
    Counter(chain.from_iterable(x.split('|') for x in df.Genre)))

Adventure    1
Animation    1
Children     1
Comedy       2
Fantasy      1
Romance      1
dtype: int64

Why do I think CPython functions are better than pandas "vectorised" string functions? They are inherently hard to vectorise. You can read more at For loops with pandas - When should I care?.


If you have to deal with NaNs, you can call a function that handles exceptions gracefully:

def try_split(x):
    try:
        return x.split('|')
    except AttributeError:
        return []

pd.Series(
    Counter(chain.from_iterable(try_split(x) for x in df.Genre)))

pandaically, you would do this with split, stack, and value_counts.

df['Genre'].str.split('|', expand=True).stack().value_counts()

Comedy       2
Romance      1
Children     1
Animation    1
Fantasy      1
Adventure    1
dtype: int64

The timing difference is obvious even for tiny DataFrames.

%timeit df['Genre'].str.get_dummies(sep='|').sum()
%timeit df['Genre'].str.split('|', expand=True).stack().value_counts()
%%timeit
pd.Series(
    Counter(chain.from_iterable(try_split(x) for x in df.Genre)))

2.8 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.4 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
320 µs ± 9.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    Thank you for all the examples! I learned a ton with that! I ended up trying a few ways and you're very correct to point out the timing. I have only 27,000 records and it's noticeable. Thanks! – broepke Jan 20 '19 at 21:40
3

I'm also in favor of using chain+for.

Just to document this, one more possible way is to use get_dummies

df['Genre'].str.get_dummies(sep='|').sum()
cs95
  • 379,657
  • 97
  • 704
  • 746
rafaelc
  • 57,686
  • 15
  • 58
  • 82