Problem:
Replacing multiple string patterns in a large text file is taking a lot of time. (Python)
Scenario:
I have a large text file with no particular structure to it. But, it contains several patterns. For example, email addresses and phone numbers.
The text file has over 100 different such patterns and the file is of size 10mb (size could increase). The text file may or may not contain all the 100 patterns.
At present, I am replacing the matches using re.sub()
and the approach for performing replaces looks as shown below.
readfile = gzip.open(path, 'r') # read the zipped file
lines = readfile.readlines() # load the lines
for line in lines:
if len(line.strip()) != 0: # strip the empty lines
linestr += line
for pattern in patterns: # patterns contains all regex and respective replaces
regex = pattern[0]
replace = pattern[1]
compiled_regex = compile_regex(regex)
linestr = re.sub(compiled_regex, replace, linestr)
This approach is taking a lot of time for large files. Is there a better way to optimize it?
I am thinking of replacing +=
with .join()
but not sure how much that would help.