0

I have created a custom function to clean up a large text body through regular expressions in Python 3.7. I am using jupyter notebook 6.0.3

import numpy as np
import pandas as pd
import re
import string

def pre_process(arr):
    legal_chars = string.ascii_letters + string.punctuation + string.digits + string.whitespace + "äÄöÖüÜ"
    while "  " in arr: # removes unnecessary empty spaces
        arr = arr.replace("  ", " ")

    while "\n\n" in arr: # removes unnecessary new lines
        arr = arr.replace("\n\n", "\n") 

    for char in arr: # removes illegal charachters
        if char not in legal_chars:
            arr=arr.replace(char,"")    

    pattern4 = r"[\d]+\W[\d]+" # remove long numbers separated with non-digit
    pattern4_1 = r"[\d]+\W[\d]+"
    arr = re.sub(pattern4, '1', arr)
    arr = re.sub(pattern4_1, '', arr)

    pattern5 = r"\W[\d]+\W[\d]+\W" # remove long numbers enclosed by non-digit
    pattern6 = r"\W[\d]+\W"
    arr = re.sub(pattern5, '.', arr)
    arr = re.sub(pattern6, '', arr)

    pattern1 = r"\d{5,}" # remove long numbers
    arr = re.sub(pattern1, '', arr)
    return arr

When run on the respective column in my smaller testing dataframe directly with .apply - it returns me the expected results and the text is cleaned.

I need to however apply this to a much larger dataframe and wanted to try speeding things with the the multiprocessing package.

I used:

import multiprocessing as mp
with mp.Pool() as pool:
    df_t["Text"] = pool.map(pre_process,df_t["Text"])

I have used multiprocessing on the same dataframe with built in functions successfully, but when run with my custom function, nothing happens. Kernel just freezes. I tried with pool.apply() as well with no results.

Could it be a problem in my function or am I implementing multiprocessing in a wrong way?

I tried applying the suggestions here: multiprocessing.Pool: When to use apply, apply_async or map? but no change.

4o4o_Adv
  • 43
  • 6

1 Answers1

0

I do not see any problem with your code. In fact I was able to run it on my local machine without issue on a dummy pandas DataFrame. I have some thoughts though on the potential cause of the problem. I had issues with the multiprocessing package using python 3.7 on PyCharm 2019 before. I resolved the issue by downgrading to PyCharm 2018 and python 3.6. I ran your code with this configuration on a dummy DataFrame without any problem.

You can check this link concerning the problem (if of course my guess is true)

mustafasencer
  • 723
  • 4
  • 12