First, note that you can use "sum" to concatenate lists, because +
concatenates lists in Python:
df.keywords.sum()
# out: ['a', 'b', 'c', 'c', 'd', 'a', 'b', 'c', 'd', 'b', 'c', 'g', 'h', 'i']
Then either:
import collections
collections.Counter(df.keywords.sum())
# out: Counter({'a': 2, 'b': 3, 'c': 4, 'd': 2, 'g': 1, 'h': 1, 'i': 1})
Or:
np.unique(df.keywords.sum(), return_counts=True)
# out: (array(['a', 'b', 'c', 'd', 'g', 'h', 'i'], dtype='<U1'), array([2, 3, 4, 2, 1, 1, 1]))
Or:
uniq = np.unique(df.keywords.sum(), return_counts=True)
pd.Series(uniq[1], uniq[0])
# out:
a 2
b 3
c 4
d 2
g 1
h 1
i 1
Or:
pd.Series(collections.Counter(df.keywords.sum()))
# out: same as previous
Performance wise it's about the same whether you use np.unique()
or collections.Counter
, because df.keywords.sum()
is actually not so fast. If you care about performance, a pure Python list flattening is much faster:
collections.Counter([item for sublist in df.keywords for item in sublist])