Stream parsing of big HTML

Question

I have a huge HTML file (tens of megabytes) on some server, which I need to download and parse periodically, detecting changes. So, I'm trying to use most common tools for this tasks - requests and lxml.

The common recipe for stream parsing I found looks similar to this:

def fast_iter(url):
    resp = requests.get(
        url,
        stream=True
    )
    context = etree.iterparse(resp.raw, html=True)
    for event, elem in context:
        print(elem)
        if event == 'end' and elem.tag in TAGS:
            yield elem
        elem.clear()
        while elem.getprevious() is not None:
            if elem.getparent():
                del elem.getparent()[0]
            else:
                break
    del context

But in my case it just doesn't work, as iterparse() goes crazy and returns some elements never present in source HTML file (and it's not broken!):

<Element vqoe at 0x7eff9762b448>
<Element jzu at 0x7eff9762b408>
<Element vvu at 0x7eff9762b3c8>
<Element d at 0x7eff9762b388>
<Element s at 0x7eff9762b348>
<Element ss_lt at 0x7eff9762b308>

The funny thing is that when I'm just saving this file locally and parse ir from FS - everything goes okay, but I really want to avoid this useless step. Is there something different with file-like object returned by requests? And what's wrong with such approach?

requests==2.9.1

lxml==3.5.0

Sorry, it's some internal stuff. But actually, it's just standard file listing returned by Apache server, it's just kinda huge. — Enchantner, Jan 22 '16 at 17:57
see this answer http://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py — Joran Beasley, Jan 22 '16 at 18:06
Guys, I do know how chunking works, and it's really useful for downloading huge files. I just don't get why it doesn't work - requests returns file-like object, and parser accepts file-like object. No matter what, they should be compatible, but they aren't. — Enchantner, Jan 22 '16 at 18:09

score 0 · Answer 1 · answered Oct 06 '22 at 19:36

you need to explicitly ask for decoding:

def fast_iter(url):
    response = requests.post(url, stream=True)
    response.raw.decode_content = True
    context = etree.iterparse(response.raw, html=True, events={"end"})
    for event, elem in context:
        if event == 'end':
            yield elem

Stream parsing of big HTML

1 Answers1