Download and decompress gzipped file in memory?

Question

I would like to download a file using urllib and decompress the file in memory before saving.

This is what I have right now:

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
outfile = open(outFilePath, 'w')
outfile.write(decompressedFile.read())

This ends up writing empty files. How can I achieve what I'm after?

Updated Answer:

#! /usr/bin/env python2
import urllib2
import StringIO
import gzip

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"        
# check filename: it may change over time, due to new updates
filename = "man-pages-5.00.tar.gz" 
outFilePath = filename[:-3]

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile)

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())

I am decompressing to disk, just never letting the compressed bytes touch the disk. — OregonTrail, Mar 12 '13 at 03:30
unrelated: you could use `shutil.copyfileobj(decompressed_file, outfile)` to save the file chunk by chunk without loading it in memory. — jfs, Jun 11 '15 at 11:58

score 52 · Accepted Answer · answered Mar 12 '13 at 04:25

52

You need to seek to the beginning of compressedFile after writing to it but before passing it to gzip.GzipFile(). Otherwise it will be read from the end by gzip module and will appear as an empty file to it. See below:

#! /usr/bin/env python
import urllib2
import StringIO
import gzip

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
filename = "man-pages-3.34.tar.gz"
outFilePath = "man-pages-3.34.tar"

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
#
# Set the file's current position to the beginning
# of the file so that gzip.GzipFile can read
# its contents from the top.
#
compressedFile.seek(0)

decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())

answered Mar 12 '13 at 04:25

crayzeewulf

5,840
1
27
30

4

Turns out I could have taken advantage of StringIO's `__init__`, see updated question. – OregonTrail Mar 12 '13 at 05:26
Yeah. That works even better. :) I will leave my answer unedited as you've already added the updated answer. Thanks. – crayzeewulf Mar 12 '13 at 05:28
@OregonTrail: or you could cut out the middleman and [pass `response` directly](http://stackoverflow.com/a/26435241/4279). btw, don't put *answers* into the question; [you are encouraged to post your own answer](http://stackoverflow.com/help/self-answer). – jfs Jun 11 '15 at 11:54

lyschoening · Answer 2 · 2015-06-11T12:41:12.807

25

For those using Python 3, the equivalent answer is:

import urllib.request
import io
import gzip

response = urllib.request.urlopen(FILE_URL)
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)

with open(OUTFILE_PATH, 'wb') as outfile:
    outfile.write(decompressed_file.read())

edited Jun 11 '15 at 12:41

answered Feb 09 '15 at 15:03

lyschoening

18,170
11
44
54

1

it won't work: you are trying to write bytes into a text file; use binary mode instead. Try: `copyfileobj(GzipFile(fileobj=response), open(outfile_path, 'wb'))` – jfs Jun 11 '15 at 12:01

score 18 · Answer 3 · answered Dec 05 '15 at 18:41

If you have Python 3.2 or above, life would be much easier:

#!/usr/bin/env python3
import gzip
import urllib.request

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
filename = "man-pages-4.03.tar.gz"
outFilePath = filename[:-3]

response = urllib.request.urlopen(baseURL + filename)
with open(outFilePath, 'wb') as outfile:
    outfile.write(gzip.decompress(response.read()))

For those who are interested in history, see https://bugs.python.org/issue3488 and https://hg.python.org/cpython/rev/3fa0a9553402.

score -3 · Answer 4 · edited Sep 29 '17 at 07:37

-3

One line code to print the decompressed file content:

print gzip.GzipFile(fileobj=StringIO.StringIO(urllib2.urlopen(DOWNLOAD_LINK).read()), mode='rb').read()

edited Sep 29 '17 at 07:37

Cédric Julien

78,516
15
127
132

answered Mar 02 '17 at 11:54

BaiJiFeiLong

3,716
1
30
28

Download and decompress gzipped file in memory?

4 Answers4

Linked