4

When determining whether an instance of substring exists in a larger string,

I am considering two options:

(1)

if "aaaa" in "bbbaaaaaabbb":
    dosomething()

(2)

pattern = re.compile("aaaa")
if pattern.search("bbbaaaaaabbb"):
    dosomething()

Which of the two are more efficient & faster (considering the size of the string is huge)??

Is there any other option that is faster??

Thanks

user2492270
  • 2,215
  • 6
  • 40
  • 56
  • `in` statements will be the fastest http://stackoverflow.com/questions/4901523/whats-a-faster-operation-re-match-search-or-str-find – bitoffdev Nov 11 '13 at 16:48
  • regular expressions are much more complicated than simply checking for membership ... in any language ... – Joran Beasley Nov 11 '13 at 17:03

3 Answers3

7

Regex will be slower.

$ python -m timeit '"aaaa" in "bbbaaaaaabbb"'
10000000 loops, best of 3: 0.0767 usec per loop
$ python -m timeit -s 'import re; pattern = re.compile("aaaa")' 'pattern.search("bbbaaaaaabbb")'
1000000 loops, best of 3: 0.356 usec per loop
ASGM
  • 11,051
  • 1
  • 32
  • 53
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
7

Option (1) definitely is faster. For the future, do something like this to test it:

>>> import time, re
>>> if True:
...     s = time.time()
...     "aaaa" in "bbbaaaaaabbb"
...     print time.time()-s
... 
True
1.78813934326e-05

>>> if True:
...     s = time.time()
...     pattern = re.compile("aaaa")
...     pattern.search("bbbaaaaaabbb")
...     print time.time()-s
... 
<_sre.SRE_Match object at 0xb74a91e0>
0.0143280029297

gnibbler's way of doing this is better, I never really played around with interpreter options so I didn't know about that one.

jazzpi
  • 1,399
  • 12
  • 18
  • Why the `if True`? Why time a single iteration with `time.time` and not using `timeit`? –  Nov 11 '13 at 16:56
  • @delnan Didn't know about `timeit`, so I used a `if` to make it get executed all at once since I was to lazy to create files... – jazzpi Nov 11 '13 at 16:58
4

I happen to have the E.coli genome at hand, so I tested the two options... Looking for "AAAA" in the E.coli genome 10,000,000 times (just to have decent times) with option (1) takes about 3.7 seconds. With option (2), of course with pattern = re.compile("AAAA") out of the loop, it took about 8.4 seconds. "dosomething()" in my case was adding 1 to an arbitrary variable. The E. coli genome I used is 4639675 nucleotides (letters) long.

Roberto
  • 2,696
  • 18
  • 31