0

I have a Wikipedia dump that uncompressed is ~75 GB (compressed: ~16 GB). I have tried using something along the lines of

from xml.etree.ElementTree import iterparse

for event, elem in iterparse('enwiki-latest-pages-articles-multistream.xml'):
    if elem.tag == "___":
        #do something
        elem.clear()

The kernel ends up dying in Jupyter Notebook after a little time. The thing is I don't want all of the data in this dump (supposedly ~1000M lines) -- I want to filter it for only a few entities. But to do this, I would have to read it in first right? This is seemingly what's causing the kernel to die. I just wanted a very small subset of it, and wanted to see if there is a way to accomplish this filtering in Jupyter of such as large XML file.

formicaman
  • 1,317
  • 3
  • 16
  • 32

1 Answers1

3

But to do this, I would have to read it in first right?

Actually, no.

Generally speaking, there are two ways to process XML data. One approach does "read it all into memory," creating an in-memory data structure all at once. But the other approach, generically called SAX, reads through the XML file and calls "handlers" in your code at specified points. The file can be arbitrarily large.

There is also another technology called "XPath expressions." This lets you construct a string that tells the XPath engine what nodes you want to find. XPath then returns a list of corresponding nodes to you. You don't have to "write a program"(!) to get the results you need, as long as XPath can do the work for you. (I recommend using libxml2, which is an industry-standard binary engine for doing this sort of thing. See How to use Xpath in Python?)

Mike Robinson
  • 8,490
  • 5
  • 28
  • 41
  • Thanks! So what you're saying is that I could use libxml2 and basically it will only read in the lines that contain a certain string, for example, rather than the entire file into memory. – formicaman Feb 10 '20 at 20:32