I want to parse and act on a huge amount of data stored in a Pandas dataframe. I've tried using multiprocessing and despite getting "results", I don't feel that I'm doing it right and I would very much appreciate a sanity check.
Essentially, the code takes a pandas dataframe (df) and splits the df into a number of smaller dataframes (if 10 cores then 10 splits). It then starts up multiprocessing.Process and supply one of the smaller dataframes to each Process instance along with the target (doTheWork). The target method (doTheWork), gets busy with the smaller passed-in dataframe and then adds the results of the dataframe manipulation to a queue. Later, when all processes are finished, data is then retrieved from the queue.
Here's a trimmed-down version of the code:
def doTheWork(df, someString, someData, queue):
newPatientsList = []
#do all the Pandas work in here on object df
#lots of Pandas and data manipulation here
#then add my results to the queue
queue.put(newPatientsList)
def startHere(self, df, numCores):
splitList = []
#split the pandas dataframe by the number of cores and add each
#as an element to the splitList
queue = multiprocessing.Queue()
print("...using " + str(numCores) + " cores.")
with multiprocessing.Manager() as manager:
workers = []
for i in range(numCores):
#build a new thread for each process and point to the doTheWork method
workers.append(multiprocessing.Process(
target=self.doTheWork,
args=(splitList[i], someString, someData, queue)))
#go through each worker and start their job
for worker in workers:
worker.start()
#go through the queue pulling out data and do stuff
newPatients = Patients()
for i in range(numCores):
patientList = queue.get()
for patient in patientList:
newPatients.addPatient(patient)
for worker in workers:
worker.join()
return newPatients