I have downloaded and extracted the full Wikipedia XML dump (60+ GB, single XML file) 'enwiki-20170820-pages-articles-multistream.xml.bz2
I am interested in the title and text from every page.
I need to be able to look up specific strings in each text, for selected titles only.
Questions:
1) How do I effectively clean the XML files? I would like to remove anything irrelevant, other than the text and title fields.
An example of a page could be:
<page>
<title>Afrika</title>
<ns>0</ns>
<id>2</id>
<revision>
<id>1428708</id>
<parentid>1391377</parentid>
<timestamp>2016-03-06T14:00:12Z</timestamp>
<contributor>
<username>SpesBona</username>
<id>2720</id>
</contributor>
<comment>Uitgebrei</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
'''Afrika''' is die wêreld se tweede grootste [[kontinent]] in sowel
oppervlakte as bevolking. Saam met die eilande beslaan dit ongeveer
30,221,532km² wat 20,3% van die totale landoppervlakte van die [[aarde]]
is en dit word bewoon deur meer as 1 miljard mense - ongeveer 'n sewende
van die wêreldbevolking.
</text>
</revision>
Preferably, the only information I would need would be:
<page>
<title>Afrika</title>
<text xml:space="preserve">
'''Afrika''' is die wêreld se tweede grootste [[kontinent]] in sowel
oppervlakte as bevolking. Saam met die eilande beslaan dit ongeveer
30,221,532km² wat 20,3% van die totale landoppervlakte van die [[aarde]]
is en dit word bewoon deur meer as 1 miljard mense - ongeveer 'n sewende
van die wêreldbevolking.
</text>
</page>
However; I have never used XML or done any XML parsing before, so I am a bit lost in how to do this with such a large file.
I have tried using regular expression, but I would like to know if there is any way to do this in Python using any of their XML handling modules?
2) What would be an optimal data structure when having to search through such a massive text file? Would it be advisable to create a new file entirely with the cleaned data, or maybe use a database like MongoDB for look-ups?