#1 Delimited values
If your authors are clearly delineated, e.g. comma-separated in each series element, you can use collections.Counter
with itertools.chain
:
from collections import Counter
from itertools import chain
res = Counter(chain.from_iterable(df['Authors'].str.split(',').map(set)))
# Counter({'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1,
# 'John Williams': 2, 'Philip K Dick': 1, 'Philip Roth': 1,
# 'Ursula K Le Guin': 1})
#2 Arbitrary strings
Of course, such structured data isn't always available. If your series elements are strings with arbitrary data and your list of pre-defined authors is small, you can use pd.Series.str.contains
.
L = ['George Orwell', 'John Steinbeck', 'Frank Herbert', 'John Williams']
res = {i: df['Authors'].str.contains(i, regex=False).sum() for i in L}
# {'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1, 'John Williams': 2}
This works because pd.Series.str.contains
returns a series of Boolean values, which you can then sum since True
is considered equivalent to 1
with most numeric computations in Python / Pandas. We turn off regex
to improve performance.
Performance
Pandas string-based methods are notoriously slow. You can instead use sum
with a generator expression and the in
operator for an extra speed-up:
df = pd.concat([df]*100000)
%timeit {i: df['Authors'].str.contains(i, regex=False).sum() for i in L} # 420 ms
%timeit {i: sum(i in x for x in df['Authors'].values) for i in L} # 235 ms
%timeit {i: df['Authors'].map(lambda x: i in x).sum() for i in L} # 424 ms
Notice, for scenario #1, Counter
methods are actually more expensive because they require splitting as a preliminary step:
chainer = chain.from_iterable
%timeit Counter(chainer([set(i.split(',')) for i in df['Authors'].values])) # 650 ms
%timeit Counter(chainer(df['Authors'].str.split(',').map(set))) # 828 ms
Further improvements
- Solutions for scenario #2 are not perfect, since they won't differentiate (for example) between
John Williams
and John Williamson
. You may wish to use a specialist package if this kind of differentiation is important to you.
- For both #1 and #2 you may wish to consider the Aho-Corasick algorithm. There is one example implementation, although more work may be required for a count of elements found within each row.
Setup
df = pd.DataFrame({'Authors': ['Ursula K Le Guin,Philip K Dick,Frank Herbert,Ursula K Le Guin',
'John Williams,Philip Roth,John Williams,George Orwell',
'George Orwell,John Steinbeck,George Orwell,John Williams']})