At the MongoNYC 2013 conference, a speaker mentioned they had used copies of wikipedia to test their full text search. I've tried to replicate this myself, but have found it nontrivial due to file size and format.
Here's what I'm doing:
$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2
$ python
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('enwiki-latest-pages-articles.xml')
Killed
Python errors out at the size of the xml file when I try to parse it with the standard XML parser. Does anyone have any other suggestions for how to convert a 9GB XML file into something JSON-y I can load into mongoDB?
UPDATE 1
Following up on Sean's suggestion below, I tried the iterated element tree as well:
>>> import xml.etree.ElementTree as ET
>>> context = ET.iterparse('enwiki-latest-pages-articles.xml', events=("start", "end"))
>>> context = iter(context)
>>> event, root = context.next()
>>> for i in context[0:10]:
... print(i)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'
>>> for event, elem in context[0:10]:
... if event == "end" and elem.tag == "record":
... print(elem)
... root.clear()
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_IterParseIterator' object has no attribute '__getitem__'
Similarly, no luck.
UPDATE 2
Following up on Asya Kamsky's suggestion below.
Here's attempting with xml2json
:
$ git clone https://github.com/hay/xml2json.git
$ ./xml2json/xml2json.py -t xml2json -o enwiki-latest-pages-articles.json enwiki-latest-pages-articles.xml
Traceback (most recent call last):
File "./xml2json/xml2json.py", line 199, in <module>
main()
File "./xml2json/xml2json.py", line 181, in main
input = open(arguments[0]).read()
MemoryError
Here's xmlutils
:
$ pip install xmlutils
$ xml2json --input "enwiki-latest-pages-articles.xml" --output "enwiki-latest-pages-articles.json"
xml2sql by Kailash Nadh (http://nadh.in)
--help for help
Wrote to enwiki-latest-pages-articles.json
But the contents are only a single record. It didn't iterate.
xmltodict
, also looked promising since it advertises using iterative Expat and being good for wikipedia. But it too also ran out of memory after 20 or so minutes:
>>> import xmltodict
>>> f = open('enwiki-latest-pages-articles.xml')
>>> doc = xmltodict.parse(f)
Killed
UPDATE 3
This is in response to Ross' answer below, modeling my parser off the link he mentions:
from lxml import etree
file = 'enwiki-latest-pages-articles.xml'
def page_handler(page):
try:
print page.get('title','').encode('utf-8')
except:
print page
print "error"
class page_handler(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True if tag == 'title' else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data.encode('utf-8'))
def close(self):
return self.text
def fast_iter(context, func):
for event, elem in context:
print(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
process_element = etree.XMLParser(target = page_handler())
context = etree.iterparse( file, tag='item' )
fast_iter(context,process_element)
The error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in fast_iter
File "iterparse.pxi", line 484, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:112653)
File "iterparse.pxi", line 537, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:113223)
File "parser.pxi", line 596, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83186)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 22, column 1