2

my df looks like this:

category       text
--------       ----
soccer         soccer game is good
soccer         soccer game
basketball     game basketball
basketball     game
volleyball     sport volleyball sport   

What I want to do is groupby category and then list the words by its frequency

category       text          frequency
--------       ----          ---------
soccer         soccer        2
               game          2 
               is            1
               good          1
basketball     game          2
               basketball    1  
volleyball     sport         2
               volleyball    1

what did I do?

  • I group all the text together

df.groupby(['category])['text'].sum()

Now all the text are on the same rows since I grouped it but I do not know how to do a Frequency Table using each word count.

Could someone please help me?

BigBen
  • 46,229
  • 7
  • 24
  • 40
floss
  • 2,603
  • 2
  • 20
  • 37
  • You could use `split` and `explode()` to get the individual words into individual rows, then groupby with `count` – G. Anderson Jan 06 '21 at 18:07

2 Answers2

3

#Method 1:

You can use series.str.split with explode and the groupby.value_counts

(df.assign(text=df['text'].str.split()).explode("text")
 .groupby("category",sort=False)['text'].value_counts())

category    text      
soccer      game          2
            soccer        2
            good          1
            is            1
basketball  game          2
            basketball    1
volleyball  sport         2
            volleyball    1
Name: text, dtype: int64

#Method 2:

For older version of pandas using np.concatenate and index.repeat with df.join (There are other methods listed here)

s = df['text'].str.split()
(df[['category']].join(pd.Series(np.concatenate(s),
                      index=df.index.repeat(s.str.len()),name='text'))
.groupby("category",sort=False)['text'].value_counts())

#Method 3: using MultiLabelBinarizer from sklearn

from sklearn.preprocessing import MultiLabelBinarizer

s = df['text'].str.split()
mlb = MultiLabelBinarizer()
mlb.fit(s)
out = pd.DataFrame(mlb.transform(s),columns=mlb.classes_).groupby(df['category']).sum()
out.replace(0,np.nan).stack().astype(int)

category              
basketball  basketball    1
            game          2
soccer      game          2
            good          1
            is            1
            soccer        2
volleyball  sport         1
            volleyball    1
dtype: int32
anky
  • 74,114
  • 11
  • 41
  • 70
  • thanks for composing the answer. I am getting this error `'DataFrame' object has no attribute 'explode'` – floss Jan 06 '21 at 18:13
  • I am locked on the version number. possible to do it via `NLTK`? – floss Jan 06 '21 at 18:17
  • I get the following error with the updated code: `ValueError: zero-dimensional arrays cannot be concatenated` – floss Jan 06 '21 at 18:21
  • @floss i just tested with the df in the question and it worked. did you check with the example df you posted? – anky Jan 06 '21 at 18:22
  • @floss added another way. Please try. All of these are tested on the above df in the question and they work for me. – anky Jan 06 '21 at 18:42
0

value_counts is the right way. Usable inside a groupby too after split in words

Glauco
  • 1,385
  • 2
  • 10
  • 20