Execution Time increases while using re.search in python

Question

I am processing a 500MB file. the processing time increased when used re.search.

Please find the below cases i have tested. In all the cases i am reading file line by line and using only one if condition.

Case1:

prnt = re.compile(r"(?i)<spanlevel level='7'>")
if prnt.search(line):
print "Matched"
out_file.write(line)
else:
out_file.write(line)

This has taken 16 seconds to read the entire file.

Case2:

if re.search(r"(?i)<spanlevel level='7'>",line):
print "Matched"
out_file.write(line)
else:
out_file.write(line)

This has taken 25 seconds to read the file.

Case3:

if "<spanlevel level='7'>" in line:
print "Matched"
out_file.write(line)
else:
out_file.write(line)

This has taken only 8 seconds to read the file.

Can any one of you please let know the diference between the three cases. and Case3 is processing very fast but i am unable to do case-insensitive match. how to do a case-insensitive match in Case3 ?

Generally, when you are not in need of regex, don't use it. Regex is usually more expensive than character scanning or indexOf — nhahtdh, Mar 18 '13 at 16:27
You are using Python 2 syntax but tagged this with `python-3.x`. What is it, python 2 or 3? — Martijn Pieters, Mar 18 '13 at 16:34
The _fastest_ way to do this would probably be to clone the [`fastsearch`](http://hg.python.org/cpython/file/2.7/Objects/stringlib/fastsearch.h) C implementation behind `str.find`, modify it to be case-insensitive, build an `ifind` extension module around it, then run that against an `mmap` of your file. But that's a lot of work. Unless someone's already done it and posted it on PyPI, or this is really important to your work, either your "case 1" or Martijn Pieters' answer is probably good enough, right? — abarnert, Mar 18 '13 at 17:31

score 4 · Accepted Answer · edited May 23 '17 at 12:21

4

Case insensitive search for case 3 first:

if "<spanlevel level='7'>" in line.lower():

By lowercasing line, you make that a lowercase search.

As for why case 2 is so much slower: using a pre-compiled regular expression is going to be faster, as you then avoid the cache lookup for the regular expression pattern for each and every line you read from the file. Under the hood, re.search() will also call re.compile() if no cached copy already exists and that extra function call and cache check is going to cost you.

That is doubly painful on Python 3.3, which switched to a new caching model using the functools.lru_cache decorator, one that is actually slower than the previous implementation. See Why are uncompiled, repeatedly used regexes so much slower in Python 3?

A simple text search with in is faster for exact text matches. Regular expressions are great for complex matching, you are simply looking for an exact match, albeit case insensitive.

edited May 23 '17 at 12:21

Community

1
1

answered Mar 18 '13 at 16:26

Martijn Pieters

1,048,767
296
4,058
3,343

1

Not completely related, but using `re.finditer` on a `mmap` file might also be a suitable alternative. – Jon Clements Mar 18 '13 at 16:29
Martijn: It doesn't look like the OP's using Python 3. – martineau Mar 18 '13 at 16:32
@martineau: I looked at the tags. Although the use `print` statement is incongruous. – Martijn Pieters Mar 18 '13 at 16:33
@martineau: Thanks for pointing that out, the question tag was indeed incorrect. – Martijn Pieters Mar 18 '13 at 17:19
Martijn: Sure...and even though it didn't apply, the references in your answer about the Py 3.3 issues is good information to disseminate. – martineau Mar 18 '13 at 19:07

Execution Time increases while using re.search in python

1 Answers1