1

I am trying to find some typos in very big text files and correct them. Basicly, I run this code:

ocr = open("text.txt")
text = ocr.readlines()
clean_text = []
for line in text:
    last = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,",     "1\\2\t\\3\\4,",     line)
    clean_text.append(last) 
new_text = open("new_text.txt", "w", newline="\n") 
for line in clean_text:
    new_text.write(line)
new_text.close()

In reality I use 're.sub' function more than 1500 times and 'text.txt' has 100.000 lines. Can I divide my text into pieces and use different cores for different parts?

khelwood
  • 55,782
  • 14
  • 81
  • 108
i2_
  • 665
  • 2
  • 7
  • 14
  • 1
    I don't know how python processes re, but in general it is optimal to call re.compile() once, and re.execute() repeatedly. – wildplasser Dec 04 '19 at 23:47

1 Answers1

3

This applies a text-processing function (currently with the re.sub from your question) to NUM_CORES equally sized chunks of your input text file, then writes them out (preserving the order from your original text input file).

from multiprocessing import Pool, cpu_count
import numpy as np
import re

NUM_CORES = cpu_count()

def process_text(input_textlines):
    clean_text = []
    for line in input_textlines:
        cleaned = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,", "1\\2\t\\3\\4,", line)
        clean_text.append(cleaned)
    return "".join(clean_text)

# read in data and convert to sequence of equally-sized chunks
with open('data/text.txt', 'r') as f:
    lines = f.readlines()

num_lines = len(lines)
text_chunks = np.array_split(lines, NUM_CORES)

# process each chunk in parallel
pool = Pool(NUM_CORES)
results = pool.map(process_text, text_chunks)

# write out results
with open("new_text.txt", "w", newline="\n") as f:
    for text_chunk in results:
        f.write(text_chunk)
Max Power
  • 8,265
  • 13
  • 50
  • 91