I am reading large number of files and parsing them so I wanted to parallelized this process. I successfully achieve that by doing following:
def parse_xml(files):
for filename in files:
with open(os.path.expanduser(filename), 'rb') as data_file:
# parse logic
return dict # key is each file name
def main():
numthreads = 16
numfiles = 24
pool = multiprocessing.Pool(processes=numthreads)
rootDir = '~/someLocation/' # directory that has all the files
paths = [rootDir + f for f in os.listdir(os.path.expanduser(rootDir)) if len(f) > 1]
try:
file_list = pool.map(parse_xml, (paths[fname:fname+numfiles] for fname in xrange(0,len(paths), numfiles)))
process_results(file_list)
pool.close()
pool.join()
except KeyboardInterrupt:
pool.terminate()
Everything works fine but now I want to parse a a set for filtering purpose into my parse_xml
method.
After reading the pool.map documentation I realize its not possible. so I wrote following additional wrapper method:
def parse_xml_wrapper(multiples):
return parse_xml(*multiples)
in my main instead of calling parse_xml I call this wrapper method while parsing the the set to it:
file_list = pool.map(parse_xml_wrapper(mySet), (paths[fname:fname+numfiles] for fname in xrange(0,len(paths), numfiles)))
I modified parse_xml
method to accept the set def parse_xml(files, mySet)
TypeError: parse_xml() takes exactly 2 arguments (110277 given)
How can I successfully parse another parameter in the method that get called in Pool