I have created a custom function to clean up a large text body through regular expressions in Python 3.7. I am using jupyter notebook 6.0.3
import numpy as np
import pandas as pd
import re
import string
def pre_process(arr):
legal_chars = string.ascii_letters + string.punctuation + string.digits + string.whitespace + "äÄöÖüÜ"
while " " in arr: # removes unnecessary empty spaces
arr = arr.replace(" ", " ")
while "\n\n" in arr: # removes unnecessary new lines
arr = arr.replace("\n\n", "\n")
for char in arr: # removes illegal charachters
if char not in legal_chars:
arr=arr.replace(char,"")
pattern4 = r"[\d]+\W[\d]+" # remove long numbers separated with non-digit
pattern4_1 = r"[\d]+\W[\d]+"
arr = re.sub(pattern4, '1', arr)
arr = re.sub(pattern4_1, '', arr)
pattern5 = r"\W[\d]+\W[\d]+\W" # remove long numbers enclosed by non-digit
pattern6 = r"\W[\d]+\W"
arr = re.sub(pattern5, '.', arr)
arr = re.sub(pattern6, '', arr)
pattern1 = r"\d{5,}" # remove long numbers
arr = re.sub(pattern1, '', arr)
return arr
When run on the respective column in my smaller testing dataframe directly with .apply - it returns me the expected results and the text is cleaned.
I need to however apply this to a much larger dataframe and wanted to try speeding things with the the multiprocessing package.
I used:
import multiprocessing as mp
with mp.Pool() as pool:
df_t["Text"] = pool.map(pre_process,df_t["Text"])
I have used multiprocessing on the same dataframe with built in functions successfully, but when run with my custom function, nothing happens. Kernel just freezes. I tried with pool.apply() as well with no results.
Could it be a problem in my function or am I implementing multiprocessing in a wrong way?
I tried applying the suggestions here: multiprocessing.Pool: When to use apply, apply_async or map? but no change.