Pandas groupby and value counts for complex strings that have multiple occurrences

Question

Suppose I have a df like this:

    stringOfInterest        trend                  
0         C                    up
1         D                  down
2         E                  down
3         C,O                  up
4         C,P                  up

I want to plot this df as a bar graph using pandas. To obtain the proper grouped bar plots, I would like to group the data by the column df["trend"] and then count the occurrence of df["stringOfInterest"] for each letter. As can be seen, some of this strings contain multiple letters separated by a ",".

Using

df.groupby("trend").stringOfInterest.value_counts().unstack(0)

produces the expected result:

trend                  down    up
stringOfInterest                        
-                       7.0   8.0
C                       3.0  11.0
C,O                     NaN   2.0
C,P                     1.0   1.0
D                       1.0   2.0
E                      15.0  14.0
E,T                     1.0   NaN

However, I would like to count the occurrence of individual characters (C,E,D,...). On the original df this can be achieved like this:

s = df.stringOfInterest.str.split(",", expand = True).stack()
s.value_counts()

This typically generates something like this:

Unfortunately, this cannot be used here after the groupby() in combination with unstack().

Maybe I am on the wrong track and some more elegant way would be preferred.

To clarify the plotting: For each letter (stringOfInterest), there must be two bars indicating the number of "up" and "down" trend(s).

Anton vBR · Accepted Answer · 2018-04-30T10:35:24.537

Based on this answer here: Pandas expand rows from list data available in column

Is this something that would help you?

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(
    {'stringOfInterest': {0: 'C', 1: 'D', 2: 'E', 3: 'C,O', 4: 'C,P'},
     'trend': {0: 'up', 1: 'down', 2: 'down', 3: 'up', 4: 'up'}})


df2 = (pd.DataFrame(df.stringOfInterest.str.split(',').tolist(), index=df.trend)
        .stack()
        .reset_index()
        .groupby('trend')[0]
        .value_counts()
        .unstack()
      ).T

df2.plot(kind='bar')
plt.show()

Another approach

We could also zip the columns together and expand.

import pandas as pd
from collections import Counter

data = [(x,i) for x,y in zip(df.trend,df.stringOfInterest.str.split(',')) for i in y]

pd.Series(Counter(data)).plot(kind='bar')

The second is much more readable! Thanks for both. Any estimate on the performance? — Fourier, Apr 30 '18 at 12:12
@Fourier The 2nd I just came up with. However, I haven't confirmed they produce the result you are asking for. — Anton vBR, Apr 30 '18 at 15:14

Pandas groupby and value counts for complex strings that have multiple occurrences

1 Answers1