0
  • I have downloaded and extracted the full Wikipedia XML dump (60+ GB, single XML file) 'enwiki-20170820-pages-articles-multistream.xml.bz2

  • I am interested in the title and text from every page.

  • I need to be able to look up specific strings in each text, for selected titles only.

Questions:

1) How do I effectively clean the XML files? I would like to remove anything irrelevant, other than the text and title fields.

An example of a page could be:

<page>
<title>Afrika</title>
<ns>0</ns>
<id>2</id>
<revision>
  <id>1428708</id>
  <parentid>1391377</parentid>
  <timestamp>2016-03-06T14:00:12Z</timestamp>
  <contributor>
    <username>SpesBona</username>
    <id>2720</id>
  </contributor>
  <comment>Uitgebrei</comment>
  <model>wikitext</model>
  <format>text/x-wiki</format>
  <text xml:space="preserve">
     '''Afrika''' is die wêreld se tweede grootste [[kontinent]] in sowel 
     oppervlakte as bevolking. Saam met die eilande beslaan dit ongeveer 
     30,221,532km² wat 20,3% van die totale landoppervlakte van die [[aarde]] 
     is en dit word bewoon deur meer as 1 miljard mense - ongeveer 'n sewende 
     van die wêreldbevolking. 
  </text>
</revision>

Preferably, the only information I would need would be:

<page>
   <title>Afrika</title>
   <text xml:space="preserve">
     '''Afrika''' is die wêreld se tweede grootste [[kontinent]] in sowel 
     oppervlakte as bevolking. Saam met die eilande beslaan dit ongeveer 
     30,221,532km² wat 20,3% van die totale landoppervlakte van die [[aarde]] 
     is en dit word bewoon deur meer as 1 miljard mense - ongeveer 'n sewende 
     van die wêreldbevolking. 
    </text>
 </page>

However; I have never used XML or done any XML parsing before, so I am a bit lost in how to do this with such a large file.

I have tried using regular expression, but I would like to know if there is any way to do this in Python using any of their XML handling modules?

2) What would be an optimal data structure when having to search through such a massive text file? Would it be advisable to create a new file entirely with the cleaned data, or maybe use a database like MongoDB for look-ups?

YoungChul
  • 165
  • 12
  • 1
    Regex is the wrong tool for parsing XML. Use XPath to navigate the parts of the XML (and then possibly regex once you get to the targeted text). If what you really want is to produce another XML file based upon your source XML file, use XSLT. What to use to store the text is a design question that to answer would require you to state more of your constraints and goals. Even with such elaborations, however, your question would still be **too broad** for this site. – kjhughes Oct 08 '17 at 20:07
  • Possible duplicate of [Wikipedia text download](https://stackoverflow.com/questions/2683506/wikipedia-text-download) – sophros Dec 10 '18 at 15:44

2 Answers2

1

Use this Python code to convert the archive into a single text file, Python code link "https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py"

Usage:

python3 WikiExtractor.py --infn dump.xml.bz2

For more information: http://wiki.apertium.org/wiki/Wikipedia_Extractor

Or you can also download old Wikipedia archives as text from here:

http://kopiwiki.dsd.sztaki.hu/

Ru Chern Chong
  • 3,692
  • 13
  • 33
  • 43
Ashish Jain
  • 447
  • 1
  • 6
  • 20
0

If you have any experience in Python you should use beautifulsoup library with the lxml parser for parsing the xml. It will let you browse through tags very easily and intuitively. http://www2.hawaii.edu/~takebaya/cent110/xml_parse/xml_parse.html

To deal with the large data size you could separate each page into a different file and load them into Python using glob and parse one file at a time. Find all files in a directory with extension .txt in Python

For the final data structure mongodb sounds pretty good. If you want to do fulltext search remember to build the text indexes. https://docs.mongodb.com/manual/core/index-text/

  • There seems to be a better way to do it - see [this question](https://stackoverflow.com/questions/2683506/wikipedia-text-download) – sophros Dec 10 '18 at 15:44