5

I have multiple .gz files that add up to 1TB in total. How can I utilize Python 2.7 to unzip these files in parallel? looping on the files takes too much time.

I tried this code as well:

filenames = [gz for gz in glob.glob(filesFolder + '*.gz')]

def uncompress(path):
    with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
        shutil.copyfileobj(src, dest)

with multiprocessing.Pool() as pool:
    for _ in pool.imap_unordered(uncompress, filenames, chunksize=1):
        pass

However I get the following error:

  with multiprocessing.Pool() as pool:

AttributeError: __exit__

Thanks!

Menkes
  • 391
  • 1
  • 5
  • 18
  • To use `with` construct, the object used inside must have `__enter__` and `__exit__` methods. The error says that the `Pool` class doesn't have these so you can't use it in the `with` statement. – 0xc0de Mar 02 '16 at 08:11
  • Not quite a duplicate, I think, but maybe [this](http://stackoverflow.com/a/24724452/3714940) answer can help? – SiHa Mar 02 '16 at 08:12
  • 3
    Side note: Are you sure that the CPU is the bottleneck? You might run into the IO limit that your backend storage (disks?) can handle. My guess is that running multiple uncompression tasks in parallel would make this even worse (think seek times). – dhke Mar 02 '16 at 08:26
  • Follow up to the IO bottleneck idea - maybe copy the files into a RAMdisk before decompressing? – SiHa Mar 02 '16 at 09:50

1 Answers1

0

To use with construct, the object used inside must have __enter__ and __exit__ methods. The error says that the Pool class (or instance) doesn't have these so you can't use it in the with statement. Try this (just removed the with statement):

import glob, multiprocessing, shutil

filenames = [gz for gz in glob.glob('.' + '*.gz')]

def uncompress(path):
    with gzip.open(path, 'rb') as src, open(path.rstrip('.gz'), 'wb') as dest:
        shutil.copyfileobj(src, dest)


for _ in multiprocessing.Pool().imap_unordered(uncompress, filenames, chunksize=1):
    pass

EDIT

I agree with @dhke, unless all (or most) of gz files are physically located adjacently, frequent disk reads for different locations (which are called more frequently when using multiprocessing) will be slower as compared to doing these operations file by file one by one (serially).

0xc0de
  • 8,028
  • 5
  • 49
  • 75