3

At the MongoNYC 2013 conference, a speaker mentioned they had used copies of wikipedia to test their full text search. I've tried to replicate this myself, but have found it nontrivial due to file size and format.

Here's what I'm doing:

$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 
$ python
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('enwiki-latest-pages-articles.xml')
Killed

Python errors out at the size of the xml file when I try to parse it with the standard XML parser. Does anyone have any other suggestions for how to convert a 9GB XML file into something JSON-y I can load into mongoDB?

UPDATE 1

Following up on Sean's suggestion below, I tried the iterated element tree as well:

>>> import xml.etree.ElementTree as ET
>>> context = ET.iterparse('enwiki-latest-pages-articles.xml', events=("start", "end"))
>>> context = iter(context)
>>> event, root = context.next()
>>> for i in context[0:10]:
...     print(i)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'
>>> for event, elem in context[0:10]:
...     if event == "end" and elem.tag == "record":
...             print(elem)
...             root.clear()
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'

Similarly, no luck.

UPDATE 2

Following up on Asya Kamsky's suggestion below.

Here's attempting with xml2json:

$ git clone https://github.com/hay/xml2json.git
$ ./xml2json/xml2json.py -t xml2json -o enwiki-latest-pages-articles.json enwiki-latest-pages-articles.xml
Traceback (most recent call last):
  File "./xml2json/xml2json.py", line 199, in <module>
    main()
  File "./xml2json/xml2json.py", line 181, in main
    input = open(arguments[0]).read()
MemoryError

Here's xmlutils:

$ pip install xmlutils
$ xml2json --input "enwiki-latest-pages-articles.xml" --output "enwiki-latest-pages-articles.json"
xml2sql by Kailash Nadh (http://nadh.in)
    --help for help


Wrote to enwiki-latest-pages-articles.json

But the contents are only a single record. It didn't iterate.

xmltodict, also looked promising since it advertises using iterative Expat and being good for wikipedia. But it too also ran out of memory after 20 or so minutes:

>>> import xmltodict
>>> f = open('enwiki-latest-pages-articles.xml')
>>> doc = xmltodict.parse(f)
Killed

UPDATE 3

This is in response to Ross' answer below, modeling my parser off the link he mentions:

from lxml import etree

file = 'enwiki-latest-pages-articles.xml'

def page_handler(page):
    try:
        print page.get('title','').encode('utf-8')
    except:
        print page
        print "error"

class page_handler(object):
    def __init__(self):
        self.text = []
    def start(self, tag, attrib):
        self.is_title = True if tag == 'title' else False
    def end(self, tag):
        pass
    def data(self, data):
        if self.is_title:
            self.text.append(data.encode('utf-8'))
    def close(self):
        return self.text

def fast_iter(context, func):
    for event, elem in context:
        print(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

process_element = etree.XMLParser(target = page_handler())

context = etree.iterparse( file, tag='item' )
fast_iter(context,process_element)

The error is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in fast_iter
  File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:112653)
  File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:113223)
  File "parser.pxi", line 596, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83186)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 22, column 1
Mittenchops
  • 18,633
  • 33
  • 128
  • 246
  • is it something in the python config or is it a hardware constraint? If it is a hardware constraint, A cloud services provider might be worth looking into for this task. – Nick Maroulis Jun 24 '13 at 22:31
  • http://stackoverflow.com/questions/324214/what-is-the-fastest-way-to-parse-large-xml-docs-in-python Has a solution that may be of use to you. – sean Jun 24 '13 at 22:36
  • It will get killed as long as it exceeds your available memory, I don't get why you need to convert to JSON. – PepperoniPizza Jun 24 '13 at 22:36
  • @marabutt, I am using an Amazon instance for this now, but it doesn't have 9GB for the task, and I was hoping to figure out a right answer before I attack it with bigger hardware. – Mittenchops Jun 24 '13 at 22:38
  • 1
    @PepperoniPizza Only interested in converting to JSON so I can insert objects into MongoDB. If you know a way of doing that directly from XML, hopefully iterated, I'm all ears. =) – Mittenchops Jun 24 '13 at 22:39
  • there is an xmltojson.py I found on the internet - if you google for that you should be able to use it - I used it to convert some massive files... – Asya Kamsky Jun 24 '13 at 23:06

2 Answers2

1

You would need to use iterparse to iterate rather than load the whole file in memory. As to how to convert to json or even to a python object for storing in the db - see: https://github.com/knadh/xmlutils.py/blob/master/xmlutils/xml2json.py

Update

An example of using iterparse and keeping a low memory footprint:

Try a variant of Liza Daly's fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings.

from lxml import etree

def fast_iter(context, func):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    for event, elem in context:
        print(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)

Daly's article is an excellent read, especially if you are processing large XML files.

Community
  • 1
  • 1
Ross
  • 17,861
  • 2
  • 55
  • 73
  • The example in Update 1 is using `iterparse` rather than `parse`. – Mittenchops Jun 25 '13 at 14:53
  • I've updated - in Update1 you have a bug which is why it fails. Don't call `__getitem__` by doing `context[0:10]` - just iterate it. – Ross Jun 25 '13 at 15:06
  • I think this pointed me in the right direction, but could you explain what form process_element would take? I am updating my answer with what I tried and what did not work. – Mittenchops Jun 27 '13 at 00:07
1

Just in case someone stumbles over this question in 2018.

Nowadays, there's a one line command available (Node.js):

https://github.com/spencermountain/dumpster-dive

BassT
  • 819
  • 7
  • 22