41

I would like to download a file using urllib and decompress the file in memory before saving.

This is what I have right now:

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
outfile = open(outFilePath, 'w')
outfile.write(decompressedFile.read())

This ends up writing empty files. How can I achieve what I'm after?

Updated Answer:

#! /usr/bin/env python2
import urllib2
import StringIO
import gzip

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"        
# check filename: it may change over time, due to new updates
filename = "man-pages-5.00.tar.gz" 
outFilePath = filename[:-3]

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile)

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())
gibbone
  • 2,300
  • 20
  • 20
OregonTrail
  • 8,594
  • 7
  • 43
  • 58

4 Answers4

52

You need to seek to the beginning of compressedFile after writing to it but before passing it to gzip.GzipFile(). Otherwise it will be read from the end by gzip module and will appear as an empty file to it. See below:

#! /usr/bin/env python
import urllib2
import StringIO
import gzip

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
filename = "man-pages-3.34.tar.gz"
outFilePath = "man-pages-3.34.tar"

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
#
# Set the file's current position to the beginning
# of the file so that gzip.GzipFile can read
# its contents from the top.
#
compressedFile.seek(0)

decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())
crayzeewulf
  • 5,840
  • 1
  • 27
  • 30
  • 4
    Turns out I could have taken advantage of StringIO's `__init__`, see updated question. – OregonTrail Mar 12 '13 at 05:26
  • Yeah. That works even better. :) I will leave my answer unedited as you've already added the updated answer. Thanks. – crayzeewulf Mar 12 '13 at 05:28
  • @OregonTrail: or you could cut out the middleman and [pass `response` directly](http://stackoverflow.com/a/26435241/4279). btw, don't put *answers* into the question; [you are encouraged to post your own answer](http://stackoverflow.com/help/self-answer). – jfs Jun 11 '15 at 11:54
25

For those using Python 3, the equivalent answer is:

import urllib.request
import io
import gzip

response = urllib.request.urlopen(FILE_URL)
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)

with open(OUTFILE_PATH, 'wb') as outfile:
    outfile.write(decompressed_file.read())
lyschoening
  • 18,170
  • 11
  • 44
  • 54
  • 1
    it won't work: you are trying to write bytes into a text file; use binary mode instead. Try: `copyfileobj(GzipFile(fileobj=response), open(outfile_path, 'wb'))` – jfs Jun 11 '15 at 12:01
18

If you have Python 3.2 or above, life would be much easier:

#!/usr/bin/env python3
import gzip
import urllib.request

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
filename = "man-pages-4.03.tar.gz"
outFilePath = filename[:-3]

response = urllib.request.urlopen(baseURL + filename)
with open(outFilePath, 'wb') as outfile:
    outfile.write(gzip.decompress(response.read()))

For those who are interested in history, see https://bugs.python.org/issue3488 and https://hg.python.org/cpython/rev/3fa0a9553402.

Chih-Hsuan Yen
  • 754
  • 2
  • 11
  • 29
-3

One line code to print the decompressed file content:

print gzip.GzipFile(fileobj=StringIO.StringIO(urllib2.urlopen(DOWNLOAD_LINK).read()), mode='rb').read()
Cédric Julien
  • 78,516
  • 15
  • 127
  • 132
BaiJiFeiLong
  • 3,716
  • 1
  • 30
  • 28