I have a Python class that uses a multiprocessing pool to process and clean a large dataset. The method that does most of the cleaning is 'dataCleaner', which needs to call a second method 'processObservation'. I am quite new to Python multiprocessing, and I cannot seem to figure out how to ensure that the method 'processObservation' will get called from 'cleanData' when a new process is spawned. How can I do this? My preference would be to keep all of these methods in the class. I suspect this has to do with the 'call' definition, but am not sure how to modify it appropriately.
def processData(self, dataset, num_procs = mp.cpu_count()):
dataSize = len(dataset)
outputDict = dict()
procs = mp.Pool(processes = num_procs, maxtasksperchild = 1)
# Generate data chunks for processing.
chunk = dataSize / num_procs
dataChunk = [(i, i + chunk) for i in range(0, dataSize, chunk)]
count = 1
print 'Number of data chunks %d' %len(dataChunk)
for i in dataChunk:
procs.apply_async(self.dataCleaner, args = (dataset[i[0]:i[1]], count, ))
count += 1
procs.close()
procs.join()
def cleanData(self, data, procNumber):
print 'Spawning new process: %d' %os.getpid()
tempDict = dict()
print len(data)
for obs in data:
key, value = processObservation(obs)
tempDict[key] = value
cPickle.dump(tempDict, open( '../dataMP/cleanedData_' + str(procNumber) + '.p', 'wb'))
def __call__(self, dataset, count):
return self.cleanData(dataset, count)