1

I want to parse compressed sitemap like www.example.com/sitemap.xml.gz and collect all the urls in sitemap without downloading sitemap.xml.gz.

there are ways to parse it after downloading sitemap.xml.gz and de-compressing it with help of lxml or beautifulsoup etc.

def parse_sitemap_gz(url):
    r = requests.get(url, stream=True)
    if 200 != r.status_code:
    return False
    file_name = url.split('/')[-1]

    # download the sitemap file
    with open(file_name, 'wb') as f:
    if not r.ok:
        print 'error in %s'%(url)
    for block in r.iter_content(1024):
        if not block:
           break
        f.write(block) # can I parse it without writing to file
        f.flush()

    # decompress gz file
    subprocess.call(['gunzip', '-f', file_name])

    # parse xml file
    page = lxml.html.parse(file_name[0:-3])
    all_urls = page.xpath('//url/loc/text()')
    #print all_urls

    # delete sitemap file now
    subprocess.call(['rm', '-rf', file_name[0:-3]])
    return all_urls

in this code I am writing compressed sitemap to file. my intension is not to write anything to file.
for learning and creating intelligent version of above code, how can I parse it with concept of decompressing gzip streams so I wont need to download file or write it to file?

Alok
  • 7,734
  • 8
  • 55
  • 100
  • What are the reasons you need to parse it as a stream? – m.wasowski Oct 25 '14 at 16:05
  • 1
    You can't parse it without at least downloading *parts* of it, that should be obvious. So what exactly are you asking - whether you can avoid *saving it to a temporary file*, or whether you can *decompress gzip stream chunk by chunk*? – Lukas Graf Oct 25 '14 at 16:07
  • I have updated my question to answer you – Alok Oct 25 '14 at 16:27

2 Answers2

9

If the only requirement is not to write to disk, and the gzip'd file doesn't have any extensions that only the gunzip utility supports and fits into memory, then you can start with:

import requests
import gzip
from StringIO import StringIO

r = requests.get('http://example.com/sitemap.xml.gz')
sitemap = gzip.GzipFile(fileobj=StringIO(r.content)).read()

Then parse sitemap through lxml as you are...

Note that it doesn't "chunk" the iterator as you might as well just get the whole file in a single request anyway.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • 1
    @AlokSinghMahor could you describe "not working" in more detail please? – Jon Clements Oct 25 '14 at 17:06
  • 1
    @AlokSinghMahor ahhh... sorry, was experimenting with Lukas' answer... my bad... I originally had `r.content` but... appears I copied some bad code in on an update :) – Jon Clements Oct 25 '14 at 17:09
4

You can avoid writing any data to a file by working with StringIO objects - they just contain data in memory, but behave like files by implementing the protocol of a file-like object.

In order to uncompress streaming gzip data, you can't directly use the Python 'gzip` module. For one, because it tries to seek to the end of the file early on, and also will fail trying to calculate the ADLER32 checksum.

But you can work around that by simply using zlib directly and decompressing chunks as they arrive. The code I use for streaming zlib decompression is based on a post by Shashank.

from functools import partial
from lxml import etree
from StringIO import StringIO
import requests
import zlib


READ_BLOCK_SIZE = 1024 * 8


def decompress_stream(fileobj):
    result = StringIO()

    d = zlib.decompressobj(16 + zlib.MAX_WBITS)
    for chunk in iter(partial(response.raw.read, READ_BLOCK_SIZE), ''):
        result.write(d.decompress(chunk))

    result.seek(0)
    return result


url = 'http://example.org/sitemap.xml.gz'
response = requests.get(url, stream=True)

sitemap_xml = decompress_stream(response.raw)
tree = etree.parse(sitemap_xml)

# Get default XML namespace
ns = tree.getroot().nsmap[None]

urls = tree.xpath('/s:urlset/s:url/s:loc/text()', namespaces={'s': ns})
for url in urls:
    print url

Note though that in order to avoid saving to a local file on disk, you don't have to either read a streaming response, or use streaming zlib decompression. All you need to do is to not save response.content to a file but to a StringIO instead.

Lukas Graf
  • 30,317
  • 8
  • 77
  • 92
  • I really like this... but since the entire document needs to be present for the `etree.parse` to work (unless one wangles some on the fly iterparse with decompressed chunks)... I think it's a bit overkill... That or I've just gone for a really naive approach (anyway +1 for an extensive answer) – Jon Clements Oct 25 '14 at 16:57
  • Nope, you didn't go for the naive approach, I fell for the red herring ;-) But because I had some familiarity with `zlib` decompression from an [earlier answer](http://stackoverflow.com/questions/12147484/extract-zlib-compressed-data-from-binary-file-in-python) I thought I'd take a stab at it. – Lukas Graf Oct 25 '14 at 17:00
  • @AlokSinghMahor you're welcome. But as stated by Jon Clements, if all you want is to avoid writing to disk, my answer is actually overkill, and doesn't give you any benefit, so I'd suggest you go with his answer. – Lukas Graf Oct 25 '14 at 17:02
  • 1
    I'll just note for some future reference, that the whole `while True` can be replaced with something like: `for chunk in iter(lambda: response.raw.read(READ_BLOCK_SIZE), ''): result.write(d.decompress(chunk))` – Jon Clements Oct 25 '14 at 17:14
  • @JonClements good point, that's basically [Hettinger's iterator-with-sentinel idiom](http://stackoverflow.com/a/25611913/1599111) ;-) Updated. – Lukas Graf Oct 25 '14 at 17:26
  • 1
    @Lukas except I'd just write his example as `for ch in chain.from_iterable(fileobj): ...` these days :) – Jon Clements Oct 25 '14 at 17:42