-1

I have an iterator producing data, which I want to decompress.

import gzip

h = open('myfile.gz', 'rb')
data = iter(lambda: h.read(1024), b'')
gzip.decompress(data)

And I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/gzip.py", line 531, in decompress
    with GzipFile(fileobj=io.BytesIO(data)) as f:
TypeError: a bytes-like object is required, not 'callable_iterator'

How can I decompress the iterator? The data can not be loaded into memory.

blueFast
  • 41,341
  • 63
  • 198
  • 344
  • Have you tried gzip.open instead? – alxrcs Jun 03 '20 at 16:20
  • If the file is too large this iterator wouldn't help much. You should use "gzip.open" and read the file in blocks of appropriate size. – Michael Butscher Jun 03 '20 at 16:21
  • 1
    @alxrcs the input is the iterator. The provided code is just an example. In my application, I have the iterator, as provided. I do not have the file. – blueFast Jun 03 '20 at 16:21
  • @MichaelButscher the file is not there in my application. The input I have for decompressing is the iterator. – blueFast Jun 03 '20 at 16:22
  • Use "next(data)" to get the whole compressed file data at once from the iterator. – Michael Butscher Jun 03 '20 at 16:22
  • 1
    @MichaelButscher the underlying data is 1PB big, and can not be read into memory. – blueFast Jun 03 '20 at 16:24
  • 1
    If you can't modify the iterator there is no chance as it reads the whole file at once. – Michael Butscher Jun 03 '20 at 16:27
  • @MichaelButscher modified code to show a better fit for my actual data – blueFast Jun 03 '20 at 16:31
  • 1
    I can't believe that it isn't possible to decompress a stream of data that is accessed linearly without seeking, because `gzip -d -c -` in Linux does exactly that - and with appropriate chunking, the output could be presented as an iterator. The only question is whether there is a convenient way to do it in pure python. – alani Jun 03 '20 at 16:33
  • @alaniwi - the public API in the `gzip` module is based around file objects with read/write methods. That's why a generic iterator with some arbitrary block size doesn't work. But your observation is a good one. You could fire up a thread that runs `gzip` as a subprocess and pumps the iterated blocks into stdin, allowing a different thread to pull the decompressed data out. – tdelaney Jun 03 '20 at 16:39
  • `gzip` uses `zlib` - you could use `zlib` directly. – tdelaney Jun 03 '20 at 16:43
  • @tdelaney thanks, `zlib` does the job! – blueFast Jun 03 '20 at 17:00

2 Answers2

2

How can I decompress the iterator?

You don't. gzip.decompress() doesn't work on an arbitrary iterator. You will need to convert the iterator to a byte stream that can be consumed by gzip.decompress(). I would start by looking at BytesIO.

Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
  • In my question I stated "how to decompress an iterator", not "a file". I have provided an example of the iterator I have, but I have no control on how this iterator is instantiated. The file does not exist in my application. – blueFast Jun 03 '20 at 16:23
  • @blueFast Thanks for the comment. I edited my question with a suggestion. – Code-Apprentice Jun 03 '20 at 16:25
  • That sounds like a plan. – blueFast Jun 03 '20 at 16:25
  • This other answer might help too https://stackoverflow.com/questions/12593576/adapt-an-iterator-to-behave-like-a-file-like-object-in-python – alxrcs Jun 03 '20 at 16:26
  • @blueFast Another posibility is to do `bytes(data)`. See https://docs.python.org/3/library/stdtypes.html#bytes. I'm not sure this will work, but it would be really quick to try. – Code-Apprentice Jun 03 '20 at 16:33
1

Thanks to @tdelaney for pointing me in the right direction:

import zlib

def unzip_iterable(data):
  decompressor = zlib.decompressobj(wbits=zlib.MAX_WBITS | 16)  # gzip format
  for chunk in data:
    yield decompressor.decompress(chunk)

h = open('myfile.gz', 'rb')
data = iter(lambda: h.read(1024), b'')

for chunk in unzip_iterable(data):
    print(len(chunk))
blueFast
  • 41,341
  • 63
  • 198
  • 344