According to this answer when multiprocessing with multiple arguments starmap should be used. The problem I am having is that one of my arguments is a constant dataframe. When I create a list of arguments to be used by my function and starmap the dataframe gets stored over and over. I though I could get around this problem using namespace, but can't seem to figure it out. My code below hasn't thrown an error, but after 30 minutes no files have written. The code runs in under 10 minutes without using multiprocessing and just calling write_file
directly.
import pandas as pd
import numpy as np
import multiprocessing as mp
def write_file(df, colIndex, splitter, outpath):
with open(outpath + splitter + ".txt", 'a') as oFile:
data = df[df.iloc[:,colIndex] == splitter]
data.to_csv(oFile, sep = '|', index = False, header = False)
mgr = mp.Manager()
ns = mgr.Namespace()
df = pd.read_table(file_, delimiter = '|', header = None)
ns.df = df.iloc[:,1] = df.iloc[:,1].astype(str)
fileList = list(df.iloc[:, 1].astype('str').unique())
for item in fileList:
with mp.Pool(processes=3) as pool:
pool.starmap(write_file, np.array((ns, 1, item, outpath)).tolist())