1

Problem:

Replacing multiple string patterns in a large text file is taking a lot of time. (Python)

Scenario:

I have a large text file with no particular structure to it. But, it contains several patterns. For example, email addresses and phone numbers.

The text file has over 100 different such patterns and the file is of size 10mb (size could increase). The text file may or may not contain all the 100 patterns.

At present, I am replacing the matches using re.sub() and the approach for performing replaces looks as shown below.

readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines 

for line in lines:
    if len(line.strip()) != 0: # strip the empty lines
        linestr += line

for pattern in patterns: # patterns contains all regex and respective replaces
    regex = pattern[0]
    replace = pattern[1]
    compiled_regex = compile_regex(regex)
    linestr = re.sub(compiled_regex, replace, linestr)

This approach is taking a lot of time for large files. Is there a better way to optimize it?

I am thinking of replacing += with .join() but not sure how much that would help.

  • Do you have regex patterns to look for or simple strings? – michaJlS Dec 16 '16 at 21:57
  • If you have such a big file, you could also sort your data with a primary key once and then simply perform a binary search, which will greatly improve performance. It's a one-time trade-off and seems like a quick win for me. Also, at that size, use of a database should be considered. If you're dealing with a lot of data, applying a structure to it almost always yields a big improvement. Hence the reason that universities often teach data structures as a single course. – FMaz Dec 16 '16 at 21:59
  • @Krazor: The question author says the file has no structure. So I'm wondering how you're thinking of sorting it? – Bill Bell Dec 16 '16 at 22:02
  • Mmh I possibly misunderstood. I thought that, whilest there isn't an explicit structure there are still some recurring patterns (n-tuples of email, cellphone, and name for example). Recurring patterns could be optimised to represent a structure. Sorting by the phonenumber (given that it is cardinal and should not even exceed MAX_INT whatsoever) seems logical. I could be mistaken. – FMaz Dec 16 '16 at 22:05
  • 2
    Related: http://stackoverflow.com/questions/15175142/how-can-i-do-multiple-substitutions-using-regex-in-python – Robᵩ Dec 16 '16 at 22:06
  • @michaJlS: Nope, not at the moment. If I may ask, how would that help? – Sanath Kumar Dec 16 '16 at 22:30
  • @Krazor I completely agree to you with the fact that structuring the data would help immensely and it could help in applying various other algorithmic techniques. But, there is no apparent structure to the file(s) being analyzed. – Sanath Kumar Dec 16 '16 at 22:32
  • 1
    Excuse me then. You should definitely, as mentioned by @salah consider the use of a generator! – FMaz Dec 16 '16 at 22:37
  • Alright, I'll try out Salah's solution. Thanks, @Krazor. – Sanath Kumar Dec 16 '16 at 22:56
  • Are these patterns simple strings, or do they actually make use of regular expression strings? – Mark Ransom Dec 16 '16 at 23:15
  • @MarkRansom They use regex strings – Sanath Kumar Dec 16 '16 at 23:24
  • [Python's Hidden Regular Expression Gems](http://lucumr.pocoo.org/2015/11/18/pythons-hidden-re-gems/) – Adrian Colomitchi Dec 17 '16 at 01:10
  • In comments below you mentioned that some regexes span across multiple lines. Would there be a way to still chunk the data so that none of regexes would operate across the chunk boundaries? Can you execute the regexes in any order you want to or is the order fixed? Are there any bigger portions of data which are not affected by any of the regexes and could those portions be identified before processing? – niemmi Dec 17 '16 at 01:22
  • A pity `re.sub(pattern, repl, string, count=0, flags=0)` doesn't allow a sequence for `repl`, to be indexed by groups matched. – greybeard Dec 17 '16 at 11:04

2 Answers2

2

you could use lineprofiler to find which lines in your code take the most time

pip install line_profiler    
kernprof -l run.py

another thing, I think you're building the string too large in memory, maybe you can make use of generators

salah
  • 439
  • 4
  • 7
  • The use of a generator makes sense in this context. I don't understand how lineprofiler will help though? – FMaz Dec 16 '16 at 22:27
  • @salah Thank you, I am not particularly aware of generators. I will look into that. So, in your opinion, splitting the text file into chunks and performing regex substitutions might be more optimized? – Sanath Kumar Dec 16 '16 at 22:36
  • 1
    I'm not entirely sure whether it'll be faster, it will definitely be more efficient though. – FMaz Dec 16 '16 at 23:02
1

You may obtain slightly better results doing :

large_list = []

with gzip.open(path, 'r') as fp:
    for line in fp.readlines():
        if line.strip():
            large_list.append(line)

merged_lines = ''.join(large_list)

for regex, replace in patterns:
    compiled_regex = compile_regex(regex)
    merged_lines = re.sub(compiled_regex, replace, merged_lines)

However, further optimization can be achieved knowing what kind of processing you apply. In fact the last line will be the one that takes up all CPU power (and memory allocation). If regexes can be applied on a per-line basis, you can achieve great results using the multiprocessing package. Threading won't give you anything because of the GIL (https://wiki.python.org/moin/GlobalInterpreterLock)

Pierre Alex
  • 473
  • 2
  • 10
  • I echo your thoughts on multiprocessing. And, my scenario involves of both applying regexes line by line and across lines (fixed structure). – Sanath Kumar Dec 16 '16 at 22:48
  • Would it be too expensive to merge all lines, perform regex matches in the merged_lines and split them again later to perform matches per line? Because there could be multiple blocks of text patterns which could get replaced and it would reduce the length of the file to analyze line by line. – Sanath Kumar Dec 16 '16 at 22:51
  • You may test different variations - the next thing you could do is identify the bottleneck (cpu -> try multiprocessing by splitting your source file or your workflow ; IO -> load everything in memory first) – Pierre Alex Jan 01 '17 at 14:37