0

I am reading large number of files and parsing them so I wanted to parallelized this process. I successfully achieve that by doing following:

def parse_xml(files):
    for filename in files:
        with open(os.path.expanduser(filename), 'rb') as data_file:
             # parse logic
    return dict # key is each file name

def main():
    numthreads = 16
    numfiles = 24
    pool = multiprocessing.Pool(processes=numthreads)
    rootDir = '~/someLocation/' # directory that has all the files
    paths = [rootDir + f for f in os.listdir(os.path.expanduser(rootDir)) if len(f) > 1]
    try:
        file_list = pool.map(parse_xml, (paths[fname:fname+numfiles] for fname in xrange(0,len(paths), numfiles)))
        process_results(file_list)
        pool.close()
        pool.join()         
    except KeyboardInterrupt:
        pool.terminate() 

Everything works fine but now I want to parse a a set for filtering purpose into my parse_xml method.

After reading the pool.map documentation I realize its not possible. so I wrote following additional wrapper method:

def parse_xml_wrapper(multiples):
    return parse_xml(*multiples)

in my main instead of calling parse_xml I call this wrapper method while parsing the the set to it:

file_list = pool.map(parse_xml_wrapper(mySet), (paths[fname:fname+numfiles] for fname in xrange(0,len(paths), numfiles)))

I modified parse_xml method to accept the set def parse_xml(files, mySet)

TypeError: parse_xml() takes exactly 2 arguments (110277 given)

How can I successfully parse another parameter in the method that get called in Pool

dano
  • 91,354
  • 19
  • 222
  • 219
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
  • i tried that approach but didn't work, could be I am not understanding the partial and lock concept as explain in the question. But if you want to keep it as duplicate thats ok. – add-semi-colons Mar 20 '15 at 20:22
  • @dano also it doesn't get parallelized once I used it like in the answer of that question. Instead files getting multiple chunks and distributed among threads all files start to process in one – add-semi-colons Mar 20 '15 at 20:26
  • 1
    You should be doing this: `func = partial(parse_xml, mySet) ; pool.map(func, (paths[fname:fname+numfiles] for fname in xrange(0,len(paths), numfiles)))`. Just make sure `parse_xml` expects `mySet` to be the first parameter. – dano Mar 20 '15 at 20:33
  • I see thats exactly I did except didn't put `mySet` as the first parameter that accept. And looks like its working – add-semi-colons Mar 20 '15 at 20:37
  • Should I delete the question? I have upvoted that question and the answer for great contribution – add-semi-colons Mar 20 '15 at 20:37
  • 1
    Cool, glad it works now. No, you don't need to delete the question, different people search different keywords to find solutions to the same problem, so having different questions that ultimately point to the same solution can be a good thing. – dano Mar 20 '15 at 20:38

0 Answers0