Download, extract and read a gzip file in Python

Question

I'd like to download, extract and iterate over a text file in Python without having to create temporary files.

basically, this pipe, but in python

curl ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz | gunzip | processing step

Here's my code:

def main():
    import urllib
    import gzip

    # Download SEED database
    print 'Downloading SEED Database'
    handle = urllib.urlopen('ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz')


    with open('SEED.fasta.gz', 'wb') as out:
        while True:
            data = handle.read(1024)
            if len(data) == 0: break
            out.write(data)

    # Extract SEED database
    handle = gzip.open('SEED.fasta.gz')
    with open('SEED.fasta', 'w') as out:
        for line in handle:
            out.write(line)

    # Filter SEED database
    pass

I don't want to use process.Popen() or anything because I want this script to be platform-independent.

The problem is that the Gzip library only accepts filenames as arguments and not handles. The reason for "piping" is that the download step only uses up ~5% CPU and it would be faster to run the extraction and processing at the same time.

EDIT: This won't work because

"Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and forth through the data stream." - dive into python

Which is why I get the error

AttributeError: addinfourl instance has no attribute 'tell'

So how does curl url | gunzip | whatever work?

Why isn't this in separate Python files? `python download.py | python extract.py | python filter.py`? — S.Lott, Aug 23 '10 at 14:33
Because executing python scripts from system commands from python scripts is messy. Also, I said that I want this to be platform-independent (meaning those people out there using Windows won't have any problems), and executing system commands makes that difficult. Does DOS even support piping? — Austin Richardson, Aug 23 '10 at 15:36

score 9 · Accepted Answer · answered Aug 23 '10 at 14:41

9

Just gzip.GzipFile(fileobj=handle) and you'll be on your way -- in other words, it's not really true that "the Gzip library only accepts filenames as arguments and not handles", you just have to use the fileobj= named argument.

answered Aug 23 '10 at 14:41

Alex Martelli

854,459
170
1,222
1,395

Thanks! Didn't see that in the docu. – Austin Richardson Aug 23 '10 at 15:21
3

Python 2: an addinfourl object (as created by `urllib.urlopen`) does not implement `tell` which is required. Thus this answer doesn’t work there. (In Python 3, `http.client.HTTPResponse` does implement `tell`.) – Chris Morgan Apr 28 '16 at 23:27
Minor correction to @ChrisMorgan's note: Python 3's `http.client.HTTPResponse` doesn't implement `tell` either, but `gzip.GzipFile` supports nonseekable files as of Python 3.2. Either way, this answer works with `urlopen` responses in Python 3, which is wonderful. – Trey Hunner Jan 04 '19 at 21:39
@TreyHunner: it looks like `tell` was only added in Python 3.5 (which was about 7 months old when I wrote that comment, and which I had installed). https://docs.python.org/3/library/http.client.html#httpresponse-objects, “Changed in version 3.5: The io.BufferedIOBase interface is now implemented and all of its reader operations are supported.” `tell` is part of IOBase, which BufferedIOBase extends. – Chris Morgan Jan 05 '19 at 10:57
@ChrisMorgan: the `tell()` method doesn't seem to work for me on the objects returned from `urllib.request.urlopen` in Python 3.7. – Trey Hunner Jan 06 '19 at 04:35
Hmm, `io.UnsupportedOperation: seek`. I dunno. ‍♂️ – Chris Morgan Jan 08 '19 at 22:43

score 3 · Answer 2 · answered Apr 13 '20 at 20:13

A python3 solution which does not require a for loop & writes the byte object directly as a binary stream:

import gzip
import urllib.request

    def download_file(url):
       out_file = '/path/to/file'

       # Download archive
       try:
          # Read the file inside the .gz archive located at url
          with urllib.request.urlopen(url) as response:
             with gzip.GzipFile(fileobj=response) as uncompressed:
                file_content = uncompressed.read()

          # write to file in binary mode 'wb'
          with open(out_file, 'wb') as f:
             f.write(file_content)
             return 0

       except Exception as e:
          print(e)
          return 1

Call the function with retval=download_file(url) to capture the return code

gibbone · Answer 3 · 2019-07-18T09:34:46.187

I've found this question while searching for methods to download and unzip a gzip file from an URL but I didn't manage to make the accepted answer work in Python 2.7.

Here's what worked for me (adapted from here):

import urllib2
import gzip
import StringIO

def download(url):
    # Download SEED database
    out_file_path = url.split("/")[-1][:-3]
    print('Downloading SEED Database from: {}'.format(url))
    response = urllib2.urlopen(url)
    compressed_file = StringIO.StringIO(response.read())
    decompressed_file = gzip.GzipFile(fileobj=compressed_file)

    # Extract SEED database
    with open(out_file_path, 'w') as outfile:
        outfile.write(decompressed_file.read())

    # Filter SEED database
    # ...
    return

if __name__ == "__main__":    
    download("ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.0/fasta_files/RF00001.fa.gz")

I changed the target URL since the original one was dead: I just looked for a gzip file served from an ftp server like in the original question.

score 0 · Answer 4 · answered Aug 05 '20 at 20:27

for python 3.8 here is my code, wrote on 08/05/2020

import re
from urllib import request
import gzip
import shutil

url1 = "https://www.destinationlighting.com/feed/sitemap_items1.xml.gz"
file_name1 = re.split(pattern='/', string=url1)[-1]
r1 = request.urlretrieve(url=url1, filename=file_name1)
txt1 = re.split(pattern=r'\.', string=file_name1)[0] + ".txt"

with gzip.open(file_name1, 'rb') as f_in:
    with open(txt1, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Download, extract and read a gzip file in Python

4 Answers4

Linked