1

I have a somewhat large dataframe, formatted something like this:

colA  colB
1     c, d
2     d, e, f
3     e, d, a

I want to get a dictionary that counts instances of unique values in colB, like:

a: 1
c: 1
d: 3
e: 2
f: 1

My naive solution would be to iterate over every row of colB, split that, then use a Counter: my_counter[current_colB_object] += 1.

However, this answer strongly discourages iterating over dataframes, especially (like in my case) large ones.

What would be the preferred way of doing this?

Zubo
  • 1,543
  • 2
  • 20
  • 26

2 Answers2

1

Try with explode and value_counts:

>>> df["colB"].str.split(", ").explode().value_counts().to_dict()
{'d': 3, 'e': 2, 'c': 1, 'f': 1, 'a': 1}
Input df:
df = pd.DataFrame({"colA": [1, 2, 3],
                   "colB": ["c, d", "d, e, f", "e, d, a"]
                   })

>>> df
   colA     colB
0     1     c, d
1     2  d, e, f
2     3  e, d, a
not_speshal
  • 22,093
  • 2
  • 15
  • 30
1

Probably faster than the other answer -- you might want to time both yourself on a sample of your data; see df.sample().

from collections import Counter
cnt = Counter()
df.colB.str.split(', ').apply(cnt.update)
dict(cnt)

Outputs

{'c': 1, 'd': 3, 'e': 2, 'f': 1, 'a': 1}
tozCSS
  • 5,487
  • 2
  • 34
  • 31