Optimize runtime of calling 'in' many times on same string

Question

In finding substrings, the in operator is the best in terms of performance for a single call. It looks like it is also O(n) average runtime.

If i want to find whether or not several substrings exist in a string, something like this:

if 'dog' in str:
    # ....
if 'cat' in str:
    # ....
if 'frog' in str:
    # ....

This would be 3n runtime, which is a lot of repeated work. Is there a way to optimize in or another library that is available that would be a faster alternative?

Are you doing the same thing if any of these are in the string? — StephenTG, Feb 24 '16 at 20:26
Duplicate of http://stackoverflow.com/questions/7128153/multiple-in-operators-in-python http://stackoverflow.com/questions/2308944/multiple-value-checks-using-in-operator-python — fukanchik, Feb 24 '16 at 20:26
This is hard to answer in the general case - what exactly are you trying to achieve? `O(3n)` is really just `O(n)`, and it's likely that any other attempt would be slower for smaller `n`. Have you had an actual performance problem and profiled it to this bottleneck? — jonrsharpe, Feb 24 '16 at 20:26
@StephenTG ideally, I can insert different actions for different matches — user4998087, Feb 24 '16 at 20:26
@jonrsharpe sure, it is still O(n), but there is a lot of repeated work being done — user4998087, Feb 24 '16 at 20:28
@Will have you ever heard the saying about having one problem and now you have 2 problems? — Joran Beasley, Feb 24 '16 at 20:29
@JoranBeasley when we run out problems we run out of things to do — Will, Feb 24 '16 at 20:30
@user4998087 that doesn't address my question! For example, if you have a long string that doesn't change and want to quickly check whether it contains a given substring, you might approach the task completely differently from if you need to count the number of appearances of a substring in a string. — jonrsharpe, Feb 24 '16 at 20:42
@jonrsharpe the objective is only to check whether or not strings are substrings. If str is a huge string, then doing O(n) for a non-trivial number of substring operations is pretty expensive. The question is if there is a more efficient way to do exactly this — user4998087, Feb 24 '16 at 20:50
You could use a trie, then once you've processed the long string finding substrings is very fast - this will only be worth it under certain circumstances, and potentially takes a lot of space. — jonrsharpe, Feb 24 '16 at 21:00

Joran Beasley · Accepted Answer · 2016-02-25T04:54:01.527

#EDIT

==============================================================

a_list  = re.sub("[^a-zA-Z ]","",s).split()#4957 words (lorum ipsum generated)
search_space = set("dog cat fish bear walrus".split())

def joranbeasley():
    return search_space.intersection(a_list)

def stephenPochmann():
    for needle in search_space:
        if needle in s: print needle


import timeit
print "Stephen Timeit:",timeit.timeit(stephenPochmann,number=1000)
print "joran Timeit:",timeit.timeit(joranbeasley,number=1000)

results

Stephen Timeit: 0.126952238343
joran Timeit: 0.148540107751

===============================================================

set(["dog","cat","frog"]).intersection(my_str.split())

might give you what you need its hard to tell and should be plenty fast ...

if your string uses delimiters other than spaces you might need to pass an argument to split with your delimiter(";" or something)

you also might have to clean your input to remove stuff like punctuation

my_cleaned_string = re.sub("[^a-zA-Z]","",my_str)

compared to @StephenPochmans if I change it a bit (ie I dont need to keep splitting every time)

import re
a_list  = re.sub("[^a-zA-Z ]","",s).split()#4957 words (lorum ipsum generated)
search_space = set("dog cat fish bear walrus".split())
def stephenPochmann():
    for needle in search_space:
        if needle in a_list: print needle

def joranbeasley():
    return search_space.intersection(a_list)

import timeit
print "Stephen Timeit:",timeit.timeit(stephenPochmann,number=1000)
print "joran Timeit:",timeit.timeit(joranbeasley,number=1000)

and the results

c:\py_exp>python test_benchmark.py
Stephen Timeit: 0.356363602542
joran Timeit: 0.166205366392

after changeing @StephenPochmans to use the string instead of the list, he is right and it is indeed faster ... I will clarify this at the top of my answer soon

def stephenPochmann():
    for needle in search_space:
        if needle in s: print needle

here is the results

Stephen Timeit: 0.126952238343
joran Timeit: 0.148540107751

Assuming otherwise-delimited words (with regular delimiters) this is still a good answer, but `split` needs an argument. — Two-Bit Alchemist, Feb 24 '16 at 20:29
@Two-BitAlchemist: Why would you use an argument with `split` here? — Matthias, Feb 24 '16 at 20:30
The intersection operator is O(n) average and O(n^2) worst case....[Time Complexity](https://wiki.python.org/moin/TimeComplexity#set) — Felix Castor, Feb 24 '16 at 20:30
@Matthias I had to reread it ... he is saying if your delimiter is something other than empty space — Joran Beasley, Feb 24 '16 at 20:31
Err, yeah, sorry for the confusion. If your words are delimited by `--` or something instead of whitespace, you have to tell `split` that — Two-Bit Alchemist, Feb 24 '16 at 20:32
@FelixCastor yes but its only doing it once as oposed to calling in for however many conditions — Joran Beasley, Feb 24 '16 at 20:32
As I suspected, this appears to actually be a lot **slower**. Please see my answer and let me know what you think. — Stefan Pochmann, Feb 24 '16 at 20:53

score 0 · Answer 2 · answered Feb 24 '16 at 20:38

If you have a lot of words you should consider moving to an algorithm build for searching multiple words inside a text.

The most popular is Aho–Corasick algorithm (https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) and you can find many implementation for it in Python. This is a just one of many tutorial explaining about the algorithm with implementing it: http://carshen.github.io/data-structures/algorithms/2014/04/07/aho-corasick-implementation-in-python.html

Optimize runtime of calling 'in' many times on same string

2 Answers2

#EDIT