I have a huge HTML file (tens of megabytes) on some server, which I need to download and parse periodically, detecting changes. So, I'm trying to use most common tools for this tasks - requests and lxml.
The common recipe for stream parsing I found looks similar to this:
def fast_iter(url):
resp = requests.get(
url,
stream=True
)
context = etree.iterparse(resp.raw, html=True)
for event, elem in context:
print(elem)
if event == 'end' and elem.tag in TAGS:
yield elem
elem.clear()
while elem.getprevious() is not None:
if elem.getparent():
del elem.getparent()[0]
else:
break
del context
But in my case it just doesn't work, as iterparse() goes crazy and returns some elements never present in source HTML file (and it's not broken!):
<Element vqoe at 0x7eff9762b448>
<Element jzu at 0x7eff9762b408>
<Element vvu at 0x7eff9762b3c8>
<Element d at 0x7eff9762b388>
<Element s at 0x7eff9762b348>
<Element ss_lt at 0x7eff9762b308>
The funny thing is that when I'm just saving this file locally and parse ir from FS - everything goes okay, but I really want to avoid this useless step. Is there something different with file-like object returned by requests? And what's wrong with such approach?
requests==2.9.1
lxml==3.5.0