3

In finding substrings, the in operator is the best in terms of performance for a single call. It looks like it is also O(n) average runtime.

If i want to find whether or not several substrings exist in a string, something like this:

if 'dog' in str:
    # ....
if 'cat' in str:
    # ....
if 'frog' in str:
    # ....

This would be 3n runtime, which is a lot of repeated work. Is there a way to optimize in or another library that is available that would be a faster alternative?

Community
  • 1
  • 1
user4998087
  • 317
  • 1
  • 10
  • Are you doing the same thing if any of these are in the string? – StephenTG Feb 24 '16 at 20:26
  • Duplicate of http://stackoverflow.com/questions/7128153/multiple-in-operators-in-python http://stackoverflow.com/questions/2308944/multiple-value-checks-using-in-operator-python – fukanchik Feb 24 '16 at 20:26
  • 2
    This is hard to answer in the general case - what exactly are you trying to achieve? `O(3n)` is really just `O(n)`, and it's likely that any other attempt would be slower for smaller `n`. Have you had an actual performance problem and profiled it to this bottleneck? – jonrsharpe Feb 24 '16 at 20:26
  • @StephenTG ideally, I can insert different actions for different matches – user4998087 Feb 24 '16 at 20:26
  • a regular expression might help? – Will Feb 24 '16 at 20:27
  • @ap no order does not matter – user4998087 Feb 24 '16 at 20:27
  • @jonrsharpe sure, it is still O(n), but there is a lot of repeated work being done – user4998087 Feb 24 '16 at 20:28
  • @Will have you ever heard the saying about having one problem and now you have 2 problems? – Joran Beasley Feb 24 '16 at 20:29
  • @JoranBeasley when we run out problems we run out of things to do – Will Feb 24 '16 at 20:30
  • 1
    @user4998087 that doesn't address my question! For example, if you have a long string that doesn't change and want to quickly check whether it contains a given substring, you might approach the task completely differently from if you need to count the number of appearances of a substring in a string. – jonrsharpe Feb 24 '16 at 20:42
  • @jonrsharpe the objective is only to check whether or not strings are substrings. If str is a huge string, then doing O(n) for a non-trivial number of substring operations is pretty expensive. The question is if there is a more efficient way to do exactly this – user4998087 Feb 24 '16 at 20:50
  • You could use a trie, then once you've processed the long string finding substrings is very fast - this will only be worth it under certain circumstances, and potentially takes a lot of space. – jonrsharpe Feb 24 '16 at 21:00

2 Answers2

5

#EDIT

==============================================================

a_list  = re.sub("[^a-zA-Z ]","",s).split()#4957 words (lorum ipsum generated)
search_space = set("dog cat fish bear walrus".split())

def joranbeasley():
    return search_space.intersection(a_list)

def stephenPochmann():
    for needle in search_space:
        if needle in s: print needle


import timeit
print "Stephen Timeit:",timeit.timeit(stephenPochmann,number=1000)
print "joran Timeit:",timeit.timeit(joranbeasley,number=1000)

results

Stephen Timeit: 0.126952238343
joran Timeit: 0.148540107751

===============================================================

set(["dog","cat","frog"]).intersection(my_str.split())

might give you what you need its hard to tell and should be plenty fast ...

if your string uses delimiters other than spaces you might need to pass an argument to split with your delimiter(";" or something)

you also might have to clean your input to remove stuff like punctuation

my_cleaned_string = re.sub("[^a-zA-Z]","",my_str)

compared to @StephenPochmans if I change it a bit (ie I dont need to keep splitting every time)

import re
a_list  = re.sub("[^a-zA-Z ]","",s).split()#4957 words (lorum ipsum generated)
search_space = set("dog cat fish bear walrus".split())
def stephenPochmann():
    for needle in search_space:
        if needle in a_list: print needle

def joranbeasley():
    return search_space.intersection(a_list)

import timeit
print "Stephen Timeit:",timeit.timeit(stephenPochmann,number=1000)
print "joran Timeit:",timeit.timeit(joranbeasley,number=1000)

and the results

c:\py_exp>python test_benchmark.py
Stephen Timeit: 0.356363602542
joran Timeit: 0.166205366392

after changeing @StephenPochmans to use the string instead of the list, he is right and it is indeed faster ... I will clarify this at the top of my answer soon

def stephenPochmann():
    for needle in search_space:
        if needle in s: print needle

here is the results

Stephen Timeit: 0.126952238343
joran Timeit: 0.148540107751
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0

If you have a lot of words you should consider moving to an algorithm build for searching multiple words inside a text.

The most popular is Aho–Corasick algorithm (https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) and you can find many implementation for it in Python. This is a just one of many tutorial explaining about the algorithm with implementing it: http://carshen.github.io/data-structures/algorithms/2014/04/07/aho-corasick-implementation-in-python.html

inet123
  • 764
  • 6
  • 10