2

I have a code that parses quite big amount of XML files (using xml.sax library) to extract data for future machine learning. I want the parsing part to run in parallel (I have 24 cores on a server doing also some web services, so I decided to use 20 of them). After the parsing I want to merge the results. The following code should do (and is doing) exactly what I expected, but there is a problem with the parallel thing.

def runParse(fname):
    parser = make_parser()
    handler = MyXMLHandler()
    parser.setContentHandler(handler)
    parser.parse(fname)
    return handler.getResult()

def makeData(flist, tasks=20):
    pool = Pool(processes=tasks)
    tmp = pool.map(runParse, flist)
    for result in tmp:
        # and here the merging part

When this part starts it runs for a while on 20 cores and then goes to only one, and it happens before the merging part (which will of course run on only one core).

Can anyone help to solve this problem or suggest a way to speed up the program?

Thanks!

ppiikkaaa

1 Answers1

1

Why do you say it goes to only one before completing?

You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.

You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of analysis is not important (as it seems from your example).

Here's the relevant documentation. Worth noting the line:

For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

Paolo Casciello
  • 7,982
  • 1
  • 43
  • 42
  • Thank you! Now I'm using imap with big chunksize, but it doesn't speed up the processing. The collecting phase takes much more time than processing the collected data (which is not trivial). Do you have any idea why does it take so long? – user2767966 Sep 12 '13 at 12:27
  • Collecting is of course single-process so probably it depends on the size of the results that needs to collect. You're doing a mapreduce task so one of the key points is optimizing the reduce part. If you can try also `.imap_unordered`. Probably a small improvement but... – Paolo Casciello Sep 12 '13 at 12:37