Python - Unzip .gz files in parallel

Question

I have multiple .gz files that add up to 1TB in total. How can I utilize Python 2.7 to unzip these files in parallel? looping on the files takes too much time.

I tried this code as well:

filenames = [gz for gz in glob.glob(filesFolder + '*.gz')]

def uncompress(path):
    with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
        shutil.copyfileobj(src, dest)

with multiprocessing.Pool() as pool:
    for _ in pool.imap_unordered(uncompress, filenames, chunksize=1):
        pass

However I get the following error:

  with multiprocessing.Pool() as pool:

AttributeError: __exit__

Thanks!

To use `with` construct, the object used inside must have `__enter__` and `__exit__` methods. The error says that the `Pool` class doesn't have these so you can't use it in the `with` statement. — 0xc0de, Mar 02 '16 at 08:11
Not quite a duplicate, I think, but maybe [this](http://stackoverflow.com/a/24724452/3714940) answer can help? — SiHa, Mar 02 '16 at 08:12
Side note: Are you sure that the CPU is the bottleneck? You might run into the IO limit that your backend storage (disks?) can handle. My guess is that running multiple uncompression tasks in parallel would make this even worse (think seek times). — dhke, Mar 02 '16 at 08:26
Follow up to the IO bottleneck idea - maybe copy the files into a RAMdisk before decompressing? — SiHa, Mar 02 '16 at 09:50

0xc0de · Answer 1 · 2016-03-02T08:34:09.253

To use with construct, the object used inside must have __enter__ and __exit__ methods. The error says that the Pool class (or instance) doesn't have these so you can't use it in the with statement. Try this (just removed the with statement):

import glob, multiprocessing, shutil

filenames = [gz for gz in glob.glob('.' + '*.gz')]

def uncompress(path):
    with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
        shutil.copyfileobj(src, dest)


for _ in multiprocessing.Pool().imap_unordered(uncompress, filenames, chunksize=1):
    pass

EDIT

I agree with @dhke, unless all (or most) of gz files are physically located adjacently, frequent disk reads for different locations (which are called more frequently when using multiprocessing) will be slower as compared to doing these operations file by file one by one (serially).

Python - Unzip .gz files in parallel

1 Answers1