9

I have a text file (there is no punctuation), the file size is about 100MB - 1GB, here is some example line:

please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
...

And with a list of replace tokens, like this:

check in -> check_in
full hd -> full_hd
bye bye -> bye_bye
ctrl c -> ctrl_c
...

And the output I want after replace on the text file like this:

please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
...

My current approach

replace_tokens = {'ctrl c': 'ctrl_c', ...} # a python dictionary
for line in open('text_file'):
  for token in replace_tokens:
    line = re.sub(r'\b{}\b'.format(token), replace_tokens[token])
    # Save line to file

This solution is working, but this is very slow for large number of replace tokens and large text file. Are there any better solution?

Akshat Zala
  • 710
  • 1
  • 8
  • 23
Hieu Nguyen
  • 672
  • 6
  • 16

4 Answers4

8

Use binary files and string replace as follows

  • Process files as binary to reduce overhead of file conversion
  • Use string replace rather than Regex

Code

def process_binary(filename):
    """ Replace strings using binary and string replace
        Processing follows original code flow except using
        binary files and string replace """

    # Map using binary strings
    replace_tokens = {b'ctrl c': b'ctrl_c', b'full hd': b'full_hd', b'bye bye': b'bye_bye', b'check in': b'check_in'}

    outfile = append_id(filename, 'processed')

    with open(filename, 'rb') as fi, open(outfile, 'wb') as fo:
        for line in fi:
            for token in replace_tokens:
                line = line.replace(token, replace_tokens[token])
            fo.write(line)

def append_id(filename, id):
    " Convenience handler for generating name of output file "
    return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])

Performance Comparison

On 124 Mbyte File (generated by replicating posted string):

  • Posted solution: 82.8 seconds
  • Avoid inner loop in Regex (DAWG post): 28.2 seconds
  • Current solution: 9.5 seconds

Current Solution:

  • ~8.7X improvement over the posted solution and
  • ~3X over Regex (avoiding inner loop)

General Trend

Curve using Perfplot based upon timeit

Test Code

# Generate Data by replicating posted string
s = """please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
"""
with open('test_data.txt', 'w') as fo:
    for i in range(1000000):  # Repeat string 1M times
        fo.write(s)

# Time Posted Solution
from time import time
import re

def posted(filename):
    replace_tokens = {'ctrl c': 'ctrl_c', 'full hd': 'full_hd', 'bye bye': 'bye_bye', 'check in': 'check_in'}

    outfile = append_id(filename, 'posted')
    with open(filename, 'r') as fi, open(outfile, 'w') as fo:
        for line in fi:
            for token in replace_tokens:
                line = re.sub(r'\b{}\b'.format(token), replace_tokens[token], line)
            fo.write(line)

def append_id(filename, id):
    return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])

t0 = time()
posted('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time:  82.84100198745728

# Time Current Solution
from time import time

def process_binary(filename):
    replace_tokens = {b'ctrl c': b'ctrl_c', b'full hd': b'full_hd', b'bye bye': b'bye_bye', b'check in': b'check_in'}

    outfile = append_id(filename, 'processed')
    with open(filename, 'rb') as fi, open(outfile, 'wb') as fo:
        for line in fi:
            for token in replace_tokens:
                line = line.replace(token, replace_tokens[token])
            fo.write(line)

def append_id(filename, id):
    return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])


t0 = time()
process_binary('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time:  9.593998670578003

# Time Processing using Regex 
# Avoiding inner loop--see dawg posted answer

import re 

def process_regex(filename):
    tokens={"check in":"check_in", "full hd":"full_hd",
    "bye bye":"bye_bye","ctrl c":"ctrl_c"}

    regex=re.compile("|".join([r"\b{}\b".format(t) for t in tokens]))

    outfile = append_id(filename, 'regex')
    with open(filename, 'r') as fi, open(outfile, 'w') as fo:
        for line in fi:
            line = regex.sub(lambda m: tokens[m.group(0)], line)
            fo.write(line)

def append_id(filename, id):
    return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])

t0 = time()
process_regex('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time:  28.27900242805481
DarrylG
  • 16,732
  • 2
  • 17
  • 23
5

You can at least remove the complexity of an inner loop by doing something along these lines:

import re 

tokens={"check in":"check_in", "full hd":"full_hd",
"bye bye":"bye_bye","ctrl c":"ctrl_c"}

regex=re.compile("|".join([r"\b{}\b".format(t) for t in tokens]))

with open(your_file) as f:
    for line in f:
        line=regex.sub(lambda m: tokens[m.group(0)], line.rstrip())
        print(line)

Prints:

please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
dawg
  • 98,345
  • 23
  • 131
  • 206
  • 1
    Ir seems join regex has belen tested [here](https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3). Ir does not have great benefits on performance. – Cheche Jun 18 '20 at 03:39
  • 1
    With the file only 1GB, it might also be worthwhile trying to read the whole thing at once, rather than line by line. – Jiří Baum Jun 18 '20 at 03:51
  • 1
    Try the same with the [re2](https://pypi.org/project/re2/) library? Depending on the regex, it can be much faster than the built-in one... – Jiří Baum Jun 18 '20 at 04:29
  • I tested with build-in re lib with regex group like `dawg` answer, read whole text file by `sabik` suggested. Wow, it much faster than my solution. – Hieu Nguyen Jun 18 '20 at 04:37
  • @nguyenvanhieuvn — Any difference between using `re` and `re2`, please? I'd be really curious if it helps or not (and how much)... – Jiří Baum Jun 18 '20 at 04:52
1
  • As others have suggested, making a single regex will eliminate the inner loop.

    regex = re.compile("|".join(r"\b{}\b".format(t) for t in tokens))
    
  • The re2 library can be much faster than the built-in re, especially if there's a lot of tokens and/or a lot of text.

    regex = re2.compile("|".join(r"\b{}\b".format(t) for t in tokens))
    
  • Depending on the amount of memory and the possible size of the file, it may be worthwhile trying to read the whole thing at once, rather than line by line. Especially if the lines are short, handling the lines could be consuming lot of time even though you aren't actually doing any line-oriented processing.

    text = f.read()
    text = regex.sub(lambda m: tokens[m.group(0)], text)
    

    A further refinement would use findall/finditer rather than sub, then work with the start/end offsets to output pieces of the original file, interleaved with the replacements; that would avoid having two copies of the text in memory.

    text = f.read()
    pos = 0
    for m in regex.finditer(text):
        out_f.write(text[pos:m.start(0)])
        out_f.write(tokens[m.group(0)])
        pos = m.end(0)
    out_f.write(text[pos:])
    
  • If your text is line-wrapped, you may also wish to consider whether you need to replace instances where the phrase is broken across a line; that can be done easily in the "read the whole text into memory" approach. If you need to do this but the file is too large to read into memory, you may need to do a word-oriented approach instead — one function reads the file and yields individual words, another does a word-oriented finite state machine.

Jiří Baum
  • 6,697
  • 2
  • 17
  • 17
0

For best performance, you should use an algorithm that is designed to search a text for multiple patterns simultaneously. There are several such algorithms, such as Aho-Corasick, Rabin-Karp and Commentz-Walter.

An implementation of the aho-corasick algorithm can be found on PyPI.

Matthias Berndt
  • 4,387
  • 1
  • 11
  • 25