Parallel Computing for Large Text Files

Question

I am trying to find some typos in very big text files and correct them. Basicly, I run this code:

ocr = open("text.txt")
text = ocr.readlines()
clean_text = []
for line in text:
    last = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,",     "1\\2\t\\3\\4,",     line)
    clean_text.append(last) 
new_text = open("new_text.txt", "w", newline="\n") 
for line in clean_text:
    new_text.write(line)
new_text.close()

In reality I use 're.sub' function more than 1500 times and 'text.txt' has 100.000 lines. Can I divide my text into pieces and use different cores for different parts?

I don't know how python processes re, but in general it is optimal to call re.compile() once, and re.execute() repeatedly. — wildplasser, Dec 04 '19 at 23:47

Max Power · Accepted Answer · 2019-12-04T23:24:07.460

This applies a text-processing function (currently with the re.sub from your question) to NUM_CORES equally sized chunks of your input text file, then writes them out (preserving the order from your original text input file).

from multiprocessing import Pool, cpu_count
import numpy as np
import re

NUM_CORES = cpu_count()

def process_text(input_textlines):
    clean_text = []
    for line in input_textlines:
        cleaned = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,", "1\\2\t\\3\\4,", line)
        clean_text.append(cleaned)
    return "".join(clean_text)

# read in data and convert to sequence of equally-sized chunks
with open('data/text.txt', 'r') as f:
    lines = f.readlines()

num_lines = len(lines)
text_chunks = np.array_split(lines, NUM_CORES)

# process each chunk in parallel
pool = Pool(NUM_CORES)
results = pool.map(process_text, text_chunks)

# write out results
with open("new_text.txt", "w", newline="\n") as f:
    for text_chunk in results:
        f.write(text_chunk)

Thank you. It works. Before use this code the program took 13 mins. Now it takes 2 mins. — i2_, Dec 08 '19 at 08:38

Parallel Computing for Large Text Files

1 Answers1