2

I have a dataframe in which one column contains non-unique lists of words, and another column contains numbers from 1 to 3. I want to count the occurrences of each word in the dataframe, grouped by the number in the second column.

Example:

words category
cat, dog, dog 1
cat, cat, mouse 1
mouse, cat, dog, elephant 2
elephant, elephant 3

Desired result:

word 1 2 3
cat 3 1 0
dog 2 1 0
mouse 1 1 0
elephant 0 1 2

I've found a few answers that are close to what I'm trying to do, like this one and some others that use value_counts, but none of them are quite right for this. Help?

TKR
  • 185
  • 1
  • 2
  • 11

2 Answers2

4

We can use DataFrame.explode + crosstab:

# If Not Already a List
# df['words'] = df['words'].str.split(', ')

new_df = df.explode('words')
new_df = pd.crosstab(
    new_df['words'], new_df['category']
).reset_index().rename_axis(columns=None)

Or with groupby size + unstack after explode:

new_df = (
    df.explode('words')  # Explode List into Rows
        .groupby(['words', 'category']).size()  # Calculate Group Sizes
        .unstack(fill_value=0)  # Convert Category values to column names 
        .reset_index().rename_axis(columns=None)  # Cleanup
)

or DataFrame.value_counts + unstack after explode:

new_df = (
    df.explode('words')  # Explode List into Rows
        .value_counts()  # Count Value Pairs
        .unstack(level='category',  # Convert Category values to column names
                 fill_value=0)
        .reset_index().rename_axis(columns=None)  # Cleanup
)

new_df:

      words  1  2  3
0       cat  3  1  0
1       dog  2  1  0
2  elephant  0  1  2
3     mouse  1  1  0

Setup:

import pandas as pd

df = pd.DataFrame({
    'words': [['cat', 'dog', 'dog'], ['cat', 'cat', 'mouse'],
              ['mouse', 'cat', 'dog', 'elephant'], ['elephant', 'elephant']],
    'category': [1, 1, 2, 3]
})
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
0

I feel for data structures such as this, you may have more performance if the data is wrangled outside Pandas, before returning into Pandas (of course this only matters if you care about performance, there is no need for unnecessary optimisation) - of course, tests are the only way to ensure this is True:

from collections import defaultdict
d = defaultdict(int)
for words, number in zip(df.words, df.category):
    for word in words:
        d[(word, number)] += 1


d

defaultdict(int,
            {('cat', 1): 3,
             ('dog', 1): 2,
             ('mouse', 1): 1,
             ('mouse', 2): 1,
             ('cat', 2): 1,
             ('dog', 2): 1,
             ('elephant', 2): 1,
             ('elephant', 3): 2})

Build the DataFrame:

 (pd.DataFrame(d.values(), index = d)
    .unstack(fill_value = 0)
    .droplevel(0, axis = 1)
  )

          1  2  3
cat       3  1  0
dog       2  1  0
elephant  0  1  2
mouse     1  1  0

Taking a cue from @HenryEcker, you could also use the Counter function:

 from itertools import product, chain
 from collections import Counter
# integers are put into a list as `product` works on iterables
pairing = (product(left, [right]) 
           for left, right 
           in zip(df.words, df.category))
 outcome = Counter(chain.from_iterable(pairing))
 outcome
Counter({('cat', 1): 3,
         ('dog', 1): 2,
         ('mouse', 1): 1,
         ('mouse', 2): 1,
         ('cat', 2): 1,
         ('dog', 2): 1,
         ('elephant', 2): 1,
         ('elephant', 3): 2})

Build the dataframe like before:

 (pd.DataFrame(outcome.values(), index = outcome)
    .unstack(fill_value = 0)
    .droplevel(0, axis = 1)
  )

          1  2  3
cat       3  1  0
dog       2  1  0
elephant  0  1  2
mouse     1  1  0
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    Seems like `collections.Counter` might be more idiomatic than `defaultdict`. `Counter([(word, number) for words, number in zip(df.words, df.category) for word in words])` – Henry Ecker Sep 14 '21 at 01:31
  • `Counter` is more idiomatic, however, it does a lot of other things that make it a bit slower than instantiating int with `defaultdict` – sammywemmy Sep 14 '21 at 02:02
  • 1
    I'm pretty sure that has not been true since the Counter operations were moved to C in python 3. In any case, my current testing puts them at almost exactly the same for this sample (Counter being .07µs faster) – Henry Ecker Sep 14 '21 at 02:08
  • 1
    Found a thread [Python collections.Counter() runtime](https://stackoverflow.com/q/40513659/15497888) – Henry Ecker Sep 14 '21 at 02:13