What the most efficient way to search substring through a Pandas DataFrame?

Question

I have a Pandas Dataframe containing 75k rows of text (approx. 350 chars per row). I need to search the occurrences of a list of 45k substrings within that dataframe.

Expected output is a authors_data dict containing the list of authors and the number of occurrences. Following code assumes I have a dataframe['text'] column and a list of substrings named authors_list.

authors_data = {}
for author in authors_list:
    count = 0
    for i, row in df.iterrows():
         if author in row.text:
             count += 1
authors_data[author] = count
print(author, authors_data[author])

I ran some initial tests, 10 authors took me about 50 seconds. The complete table will take me a few days to run. So I'm looking at more time efficient ways to run the code.

Is df.iterrows() fast enough? Are there any specific libraries that I should look into?

Let me know!

Can you show us an example of your dataframe? A few rows should suffice, including rows with multiple authors. — jpp, Aug 14 '18 at 14:13
imo this is a prototypical opportunity for cython conversion a la https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html. also, i'm pretty sure an iterrows() for loop is a bad-performing strategy and you're better off vectorizing with numpy or using list comprehension. but you should implement a few strategies and time it yourself. — 3pitt, Aug 14 '18 at 14:22
A very helpful answer on the topic was removed for some reason. I'll post the answer that was given to me but yest, list comprehension gives spectacularly better results. I'll read your tutorial, it sounds like the kind of thing I need. — sovnheim, Aug 14 '18 at 14:29

jpp · Answer 1 · 2018-08-14T14:59:16.747

#1 Delimited values

If your authors are clearly delineated, e.g. comma-separated in each series element, you can use collections.Counter with itertools.chain:

from collections import Counter
from itertools import chain

res = Counter(chain.from_iterable(df['Authors'].str.split(',').map(set)))

# Counter({'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1,
#          'John Williams': 2, 'Philip K Dick': 1, 'Philip Roth': 1,
#          'Ursula K Le Guin': 1})

#2 Arbitrary strings

Of course, such structured data isn't always available. If your series elements are strings with arbitrary data and your list of pre-defined authors is small, you can use pd.Series.str.contains.

L = ['George Orwell', 'John Steinbeck', 'Frank Herbert', 'John Williams']

res = {i: df['Authors'].str.contains(i, regex=False).sum() for i in L}

# {'Frank Herbert': 1, 'George Orwell': 2, 'John Steinbeck': 1, 'John Williams': 2}

This works because pd.Series.str.contains returns a series of Boolean values, which you can then sum since True is considered equivalent to 1 with most numeric computations in Python / Pandas. We turn off regex to improve performance.

Performance

Pandas string-based methods are notoriously slow. You can instead use sum with a generator expression and the in operator for an extra speed-up:

df = pd.concat([df]*100000)

%timeit {i: df['Authors'].str.contains(i, regex=False).sum() for i in L}    # 420 ms
%timeit {i: sum(i in x for x in df['Authors'].values) for i in L}           # 235 ms
%timeit {i: df['Authors'].map(lambda x: i in x).sum() for i in L}           # 424 ms

Notice, for scenario #1, Counter methods are actually more expensive because they require splitting as a preliminary step:

chainer = chain.from_iterable

%timeit Counter(chainer([set(i.split(',')) for i in df['Authors'].values]))  # 650 ms
%timeit Counter(chainer(df['Authors'].str.split(',').map(set)))              # 828 ms

Further improvements

Solutions for scenario #2 are not perfect, since they won't differentiate (for example) between John Williams and John Williamson. You may wish to use a specialist package if this kind of differentiation is important to you.
For both #1 and #2 you may wish to consider the Aho-Corasick algorithm. There is one example implementation, although more work may be required for a count of elements found within each row.

Setup

df = pd.DataFrame({'Authors': ['Ursula K Le Guin,Philip K Dick,Frank Herbert,Ursula K Le Guin',
                               'John Williams,Philip Roth,John Williams,George Orwell',
                               'George Orwell,John Steinbeck,George Orwell,John Williams']})

My list of authors is rather clearly delineated so no worries here. I just iterate through a list of strings. And list comprehension is much faster than the methods I used previously. Thanks a lot. — sovnheim, Aug 14 '18 at 14:47

score 3 · Accepted Answer · answered Aug 14 '18 at 14:40

3

I tried this and it's doing what you are looking for. You could test and see if it's faster.

for author in authors_list:
            authors_data[author] = df['AUTHORCOL'].map(lambda x: author in x).sum()

answered Aug 14 '18 at 14:40

Olivier Turcotte

46
1

Glad I could help ! :) – Olivier Turcotte Aug 14 '18 at 14:47

score 1 · Answer 3 · answered Aug 14 '18 at 14:55

Not a complete answer, but there are a few things you can do to make things faster:

-Use regular expressions: In fact create a pattern and then compile it E.g. Find out how many times a regex matches in a string in Python In your case you can compile only once each author.

-You have two loops. Supposing a reasonable number of authors, Put the smallest loop inside. You will be surprized how important this can be at times. This means, do the search for all authors before moving to the next line. 350 characters can fit to the cache of the CPU and if you are luck save you lots of times.

Taking things to the limit, but probably not that easy: a compiled pattern is an automaton that looks at each character of the input only once and recognizes the output (that is why you "compile" the patterns https://en.wikipedia.org/wiki/Deterministic_finite_automaton). You could create all automatons, and then take each character from the input and feed it to all automatons. Then you would only process each input character "once" ( times the non-constant size of authors)

score 1 · Answer 4 · answered Aug 14 '18 at 15:07

1

An one-liner might be helpful.

authors_data = {author: df.text.map(lambda x: author in x).sum() for author in authors_list}

answered Aug 14 '18 at 15:07

silgon

6,890
7
46
67

What the most efficient way to search substring through a Pandas DataFrame?

4 Answers4

#1 Delimited values

#2 Arbitrary strings

Performance

Further improvements