-1

i'm trying to handle multiprocessing in python, however, i think i might did not understand it properly.

To start with, i have dataframe, which contains texts as string, on which i want to perform some regex. The code looks as follows:

import multiprocess 
from threading import Thread

def clean_qa():
    for index, row in data.iterrows():
        data["qa"].loc[index] = re.sub("(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]", "",  str(data["qa"].loc[index]))

if __name__ == '__main__':
    threads = []
    
    for i in range(os.cpu_count()):
        threads.append(Thread(target=test_qa))
        
    for thread in threads:
        thread.start()
        
    for thread in threads:
        thread.join()

if __name__ == '__main__':
    processes = []

    for i in range(os.cpu_count()):
        processes.append(multiprocess.Process(target=test_qa))
        
    for process in processes:
        process.start()
        
    for process in processes:
        process.join()
    

When i run the function "clean_qa" not as function but simply by executing the for loop, everything works fine and it takes about 3 minutes.

However, when i use multiprocessing or threading, first of all, the execution takes about 10 minutes, and the text is not cleaned, so the dataframe is as before.

Therefore my question, what did i do wrong, why does it take longer and why does nothing happen to the dataframe?

Thank you very much!

Dalogh
  • 11
  • 1
  • Your example isn't self-contained. Where does `data` come from? Is it a global variable? In either case, changes to variables within a `multiprocessing` subprocess don't automagically propagate to the parent process, which is why "nothing happens to the dataframe". – AKX Jan 14 '22 at 11:56
  • Edit: The target function is "clean_qa" i copied the wrong code – Dalogh Jan 14 '22 at 11:57
  • Regarding the slowdown: Python threads don't run Python code concurrently due to the GIL. Even so, you'd now have N threads, let's say 8 if that's how many CPU cores you have, doing the _same_ loop, which makes no sense. – AKX Jan 14 '22 at 11:58
  • 3
    There are too many wrong things there. Maybe someone will come up and explain it all. But to start with, you don't _Break_ your task in parallel tasks - you simply replicate the full workload to all threads and all processes. (unless `test_qa`, which is not shown is _really_ smartly crafted, but I think it is just a typo for `clean_qa`, right?) – jsbueno Jan 14 '22 at 12:06
  • 1
    Attempting to combine _both_ `multiprocessing` and `threading` seems like severe overkill unless you have separately tried both in isolation. – tripleee Jan 14 '22 at 12:07
  • 1
    In your snippet there is really no checking of the resulting data. That is a weird way of asking a question, without the code that is being problematic at all. But then: the reason your "data does not change" is that: you can pass data to a subprocess by just reading a global variable there (as you do), or as a worker argument. But that data is in a completely different process, independent, sharing _nothing_ back to the original process. Any change to the data in the separate process is made only there - you have to use other mechanisms to communicate the data back to the calling process, – jsbueno Jan 14 '22 at 12:10

1 Answers1

3

This is slightly beside the point (though my comments in the original post do address the actual points), but since you're working with a Pandas dataframe, you really never want to loop over it by hand.

Looks like all you actually want here is just:

r = re.compile(r"(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]")

def clean_qa():
    data["qa"] = data["qa"].str.replace(r, "")

to let Pandas deal with the looping and parallelization.

AKX
  • 152,115
  • 15
  • 115
  • 172