0

I have a Python class that uses a multiprocessing pool to process and clean a large dataset. The method that does most of the cleaning is 'dataCleaner', which needs to call a second method 'processObservation'. I am quite new to Python multiprocessing, and I cannot seem to figure out how to ensure that the method 'processObservation' will get called from 'cleanData' when a new process is spawned. How can I do this? My preference would be to keep all of these methods in the class. I suspect this has to do with the 'call' definition, but am not sure how to modify it appropriately.

def processData(self, dataset, num_procs = mp.cpu_count()):
    dataSize = len(dataset)
    outputDict = dict()
    procs = mp.Pool(processes = num_procs, maxtasksperchild = 1)

    # Generate data chunks for processing.
    chunk = dataSize / num_procs
    dataChunk = [(i, i + chunk) for i in range(0, dataSize, chunk)]
    count = 1
    print 'Number of data chunks %d' %len(dataChunk)
    for i in dataChunk:
        procs.apply_async(self.dataCleaner, args = (dataset[i[0]:i[1]], count, ))
        count += 1
    procs.close()
    procs.join()

def cleanData(self, data, procNumber):
    print 'Spawning new process: %d' %os.getpid()
    tempDict = dict()
    print len(data)
    for obs in data:
        key, value = processObservation(obs)
        tempDict[key] = value
    cPickle.dump(tempDict, open( '../dataMP/cleanedData_' + str(procNumber) + '.p', 'wb'))

def __call__(self, dataset, count):
    return self.cleanData(dataset, count)
mle
  • 289
  • 1
  • 2
  • 12
  • Are `dataCleaner` and `cleanData` supposed to be the same method? And what specifically isn't working with what you're doing now? – dano Nov 25 '14 at 18:43
  • No they are separate methods. What's happening is that no output is being written to the pickled files. It seems that the issue is occuring within the for loop of cleanData. When I split functions out and run this using a pool outside the class it works fine. – mle Nov 25 '14 at 21:09
  • If you are having trouble with a python class, why post python functions and not the relevant methods inside the class (and the enclosing class)? – Mike McKerns Nov 27 '14 at 16:51
  • More importantly, I can't try your code, as there are bits of it missing… like `processObservation`. Please post a working selection of code that is self-contained and demonstrates your issue. – Mike McKerns Nov 27 '14 at 17:01

1 Answers1

1

It's hard to tell what's going on b/c you haven't given reproducible code or an error.

However, your issue is very likely because you are using multiprocessing from inside a class.

See: Using multiprocessing in a class and Multiprocessing: How to use Pool.map on a function defined in a class?

Community
  • 1
  • 1
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139