1

I am processing a 500MB file. the processing time increased when used re.search.

Please find the below cases i have tested. In all the cases i am reading file line by line and using only one if condition.

Case1:

prnt = re.compile(r"(?i)<spanlevel level='7'>")
if prnt.search(line):
print "Matched"
out_file.write(line)
else:
out_file.write(line) 

This has taken 16 seconds to read the entire file.

Case2:

if re.search(r"(?i)<spanlevel level='7'>",line):
print "Matched"
out_file.write(line)
else:
out_file.write(line)

This has taken 25 seconds to read the file.

Case3:

if "<spanlevel level='7'>" in line:
print "Matched"
out_file.write(line)
else:
out_file.write(line)

This has taken only 8 seconds to read the file.

Can any one of you please let know the diference between the three cases. and Case3 is processing very fast but i am unable to do case-insensitive match. how to do a case-insensitive match in Case3 ?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Fla-Hyd
  • 279
  • 7
  • 17
  • 4
    Generally, when you are not in need of regex, don't use it. Regex is usually more expensive than character scanning or indexOf – nhahtdh Mar 18 '13 at 16:27
  • 2
    You are using Python 2 syntax but tagged this with `python-3.x`. What is it, python 2 or 3? – Martijn Pieters Mar 18 '13 at 16:34
  • @MartijnPieters I am using python 2.7 – Fla-Hyd Mar 18 '13 at 17:14
  • The _fastest_ way to do this would probably be to clone the [`fastsearch`](http://hg.python.org/cpython/file/2.7/Objects/stringlib/fastsearch.h) C implementation behind `str.find`, modify it to be case-insensitive, build an `ifind` extension module around it, then run that against an `mmap` of your file. But that's a lot of work. Unless someone's already done it and posted it on PyPI, or this is really important to your work, either your "case 1" or Martijn Pieters' answer is probably good enough, right? – abarnert Mar 18 '13 at 17:31

1 Answers1

4

Case insensitive search for case 3 first:

if "<spanlevel level='7'>" in line.lower():

By lowercasing line, you make that a lowercase search.

As for why case 2 is so much slower: using a pre-compiled regular expression is going to be faster, as you then avoid the cache lookup for the regular expression pattern for each and every line you read from the file. Under the hood, re.search() will also call re.compile() if no cached copy already exists and that extra function call and cache check is going to cost you.

That is doubly painful on Python 3.3, which switched to a new caching model using the functools.lru_cache decorator, one that is actually slower than the previous implementation. See Why are uncompiled, repeatedly used regexes so much slower in Python 3?

A simple text search with in is faster for exact text matches. Regular expressions are great for complex matching, you are simply looking for an exact match, albeit case insensitive.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343