I have a Wikipedia dump that uncompressed is ~75 GB (compressed: ~16 GB). I have tried using something along the lines of
from xml.etree.ElementTree import iterparse
for event, elem in iterparse('enwiki-latest-pages-articles-multistream.xml'):
if elem.tag == "___":
#do something
elem.clear()
The kernel ends up dying in Jupyter Notebook after a little time. The thing is I don't want all of the data in this dump (supposedly ~1000M lines) -- I want to filter it for only a few entities. But to do this, I would have to read it in first right? This is seemingly what's causing the kernel to die. I just wanted a very small subset of it, and wanted to see if there is a way to accomplish this filtering in Jupyter of such as large XML file.