i'm trying to handle multiprocessing in python, however, i think i might did not understand it properly.
To start with, i have dataframe, which contains texts as string, on which i want to perform some regex. The code looks as follows:
import multiprocess
from threading import Thread
def clean_qa():
for index, row in data.iterrows():
data["qa"].loc[index] = re.sub("(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]", "", str(data["qa"].loc[index]))
if __name__ == '__main__':
threads = []
for i in range(os.cpu_count()):
threads.append(Thread(target=test_qa))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
if __name__ == '__main__':
processes = []
for i in range(os.cpu_count()):
processes.append(multiprocess.Process(target=test_qa))
for process in processes:
process.start()
for process in processes:
process.join()
When i run the function "clean_qa" not as function but simply by executing the for loop, everything works fine and it takes about 3 minutes.
However, when i use multiprocessing or threading, first of all, the execution takes about 10 minutes, and the text is not cleaned, so the dataframe is as before.
Therefore my question, what did i do wrong, why does it take longer and why does nothing happen to the dataframe?
Thank you very much!