Fastest way to compare large strings in python

Question

I have a dictionary of words with their frequencies as follows.

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

I have a set of strings as follows.

recipes_book = "For today's lesson we will show you how to make biscuit pudding using 
yummy tim tam and fresh milk."

In the above string I have "biscuit pudding", "yummy tim tam" and "fresh milk" from the dictionary.

I am currently tokenizing the string to identify the words in the dictionary as follows.

words = recipes_book.split()
for word in words:
    if word in mydictionary:
        print("Match Found!")

However it only works for one word dictionary keys. Hence, I am interested in the fastest way (because my real recipes are very large texts) to identify the dictionary keys with more than one word. Please help me.

Maybe [re.findall](https://docs.python.org/2/library/re.html#re.findall) is what you are looking for. Or maybe some other function in regex library. — Rockybilly, Oct 03 '17 at 06:54

sberry · Accepted Answer · 2017-10-03T14:56:38.143

2

Build up your regex and compile it.

import re

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

searcher = re.compile("|".join(mydictionary.keys()), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    mydictionary[match] += 1

Output after this

{'yummy tim tam': 4, 'biscuit pudding': 4, 'chocolates': 5, 'fresh milk': 3}

edited Oct 03 '17 at 14:56

answered Oct 03 '17 at 07:19

sberry

128,281
18
138
165

Thank you for the great answer. can you please tell me what exactly happens in `searcher` line because it is not very clear to me. – Oct 03 '17 at 07:32
1

Can just use `'|'.join(mydictionary)` there - the `.format` and `.keys` seems unnecessary... – Jon Clements Oct 03 '17 at 07:38
@JonClements yep. Started with format because I envisioned using a named group but opted out and never simplified. – sberry Oct 03 '17 at 14:56

J_Zar · Answer 2 · 2017-10-03T07:26:43.270

1

According to some tests, the "in" keywork is faster than "re" module:

What's a faster operation, re.match/search or str.find?

There is no problem with spaces here. Supposing mydictionary is static (predefined), I think you should probably go for the inverse thing:

for key in mydictionary.iterkeys():
    if key in recipes_book:
        print("Match Found!")
        mydictionary[key] += 1

In python2, using iterkeys you have an iterator and it's a good practice. With python3 you could cycle directly on the dict.

edited Oct 03 '17 at 07:26

answered Oct 03 '17 at 07:03

J_Zar

2,034
2
21
34

1

Just a heads up, in python3, iterkeys method is not there for dict data types. – theBuzzyCoder Oct 03 '17 at 07:22

score 0 · Answer 3 · answered Oct 03 '17 at 07:01

0

Try the other way around by search the text you want to find in the large chunk of str data.

import re
for item in mydictionary:
    match = re.search(item, recipes_book, flags=re.I | re.S)
    if match:
       start, end = match.span()
       print("Match found for %s between %d and %d character span" % (match.group(0), start, end))

answered Oct 03 '17 at 07:01

theBuzzyCoder

2,652
2
31
26

3

Why would you want to run multiple regex's though when you can search it for any of the keys at once? – Jon Clements Oct 03 '17 at 07:14
Yes, you are right!!! But what if you want to find all the patterns that match. "|" will give only one match at a time. – theBuzzyCoder Oct 03 '17 at 07:23
Well... `re.findall('|'.join(mydictionary), recipes_book)` seems to work for me... So if one had `recipes_book` as a `Counter`, you could then just do an `.update` on it... – Jon Clements Oct 03 '17 at 07:27
Cool. You can get same output through multiple solutions. Solutions depend on use cases. In my case, I also wanted to find where exactly the match is found (from which character to which character). re.findall doesn't give that. I just give the list of all match. So basically, you cannot say that one solution is better than other without knowing the use case. If re.findall works for you, it may be what others are looking for... But, I agree that re.findall is also a good solution if you want to find only the match – theBuzzyCoder Oct 03 '17 at 07:35
Sure - I was just going by what the person who asked the question was after... seems they were after occurrences rather than positions... – Jon Clements Oct 03 '17 at 07:39
Anyway - in your case - consider looping over the result of `re.finditer` - that way you can retrieve the positions of things separted by `|` without running `re.search` multiple times for each thing you're looking for... :) – Jon Clements Oct 03 '17 at 07:40

Fastest way to compare large strings in python

3 Answers3