4

I am trying to count the frequency of elements in a column of a pandas DataFrame.

Some toy data:

d = pd.DataFrame({'letters':[['a', 'b', 'c'], np.nan, ['a', 'e', 'd', 'c'], ['a', 'e', 'c']]})

What I can come up with is to loop through the rows and add values to a dictionary:

letter_count = {}
for i in range(len(d)):
    if d.iloc[i, ]['letters'] is np.nan:
        continue
    else:
        for letter in d.iloc[i, ]['letters']:
            letter_count[letter] = letter_count.get(letter, 0) + 1

This worked for me, except it was not very fast since my dataset was big. I assume by avoiding the explicit for-loop may help, but I cannot come up with a more 'pandasian' way to do this.

Any help is appreciated.

Xiaoyu Lu
  • 3,280
  • 1
  • 22
  • 34

2 Answers2

8

Use chain.from_iterable to flatten the lists, and then Counter to count them:

from itertools import chain
from collections import Counter

pd.Series(Counter(chain.from_iterable(d.letters.dropna())))

a    3
b    1
c    3
e    2
d    1
dtype: int64

Or, use value_counts for the counting step:

pd.Series(list(chain.from_iterable(d.letters.dropna()))).value_counts()

a    3
c    3
e    2
b    1
d    1
dtype: int64

Alternatively, np.unique, also quite performant:

u, c = np.unique(list(chain.from_iterable(d.letters.dropna())), return_counts=True)

pd.Series(dict(zip(u, c)))

a    3
b    1
c    3
d    1
e    2
dtype: int64
cs95
  • 379,657
  • 97
  • 704
  • 746
4

Ok unnesting again

unnesting(d.dropna(),['letters'])['letters'].value_counts()
Out[71]: 
a    3
c    3
e    2
d    1
b    1
Name: letters, dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234