I have several compressed files with sizes on the order of 2GB compressed. The beginning of each file has a set of headers which I parse and extract a list of ~4,000,000 pointers (pointers
).
For each pair of pointers (pointers[i], pointers[i+1])
for 0 <= i < len(pointers)
, I
- seek to
pointers[i]
- read
pointers[i+1]-pointer[i]
- decompress it
- do a single pass operation on that data and update a dictionary with what I find.
The issue is, I can only process roughly 30 of pointer pairs a second using a single Python process, which means each file takes more than a day to get through.
Assuming splitting up the pointers list among multiple processes doesn't hurt performance (due to each process looking at the same file, though different non-overlapping parts), how can I use multiprocessing
to speed this up?
My single threaded operation looks like this:
def search_clusters(pointers, filepath, automaton, counter):
def _decompress_lzma(f, pointer, chunk_size=2**14):
# skipping over this
...
return uncompressed_buffer
first_pointer, last_pointer = pointers[0], pointers[-1]
with open(filepath, 'rb') as fh:
fh.seek(first_pointer)
f = StringIO(fh.read(last_pointer - first_pointer))
for pointer1, pointer2 in zip(pointers, pointers[1:]):
size = pointer2 - pointer1
f.seek(pointer1 - first_pointer)
buffer = _decompress_lzma(f, 0)
# skipping details, ultimately the counter dict is
# modified passing the uncompressed buffer through
# an aho corasick automaton
counter = update_counter_with_buffer(buffer, automaton, counter)
return counter
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
automaton = load_automaton()
search_clusters(pointers, infile, autmaton, counter)
I tried changing this to use multiprocessing.Pool
:
from itertools import repeat, izip
import logging
import multiprocessing
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
def chunked(pointers, chunksize=1024):
for i in range(0, len(pointers), chunksize):
yield list(pointers[i:i+chunksize+1])
def search_wrapper(args):
return search_clusters(*args)
# parse file and return pointers list
bzf = ZimFile(infile)
pointers = bzf.cluster_pointers
counter = load_counter_dict() # returns collections.Counter()
map_args = izip(chunked(cluster_pointers), repeat(infile),
repeat(automaton.copy()), repeat(word_counter.copy()))
pool = multiprocessing.Pool(20)
results = pool.map(search_wrapper, map_args)
pool.close()
pool.join()
but after a little while of processing, I get the following message and the script just hangs there with no further output:
[DEBUG/MainProcess] cleaning up worker 0
[DEBUG/MainProcess] added worker
[INFO/PoolWorker-20] child process calling self.run()
However, if I run with a serialized version of map
without multiprocessing, things run just fine:
map(search_wrapper, map_args)
Any advice on how to change my multiprocessing code so it doesn't hang? Is it even a good idea to attempt to use multiple processes to read the same file?