Parse XML dump of a MediaWiki wiki

Question

I am trying to parse an XML dump of the Wiktionary but probably I am missing something since I don't get anything as output.

This is a similar but much shorter xml file:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it">    
 <page>
    <title>bigoto</title>
    <ns>0</ns>
    <id>24840</id>
    <revision>
      <id>1171207</id>
      <parentid>743817</parentid>
      <timestamp>2011-12-18T19:26:42Z</timestamp>
      <contributor>
        <username>GnuBotmarcoo</username>
        <id>14353</id>
      </contributor>
      <minor />
      <comment>[[Wikizionario:Bot|Bot]]: Sostituisco template {{[[Template:in|in]]}}</comment>
      <text xml:space="preserve">== wikimarkups ==</text>
      <sha1>gji6wqnsy6vi1ro8887t3bikh7nb3fr</sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
 </page>
</mediawiki>

I am interest in parsing the content of the <title> element if the <ns> element equals 0.

This is my script

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()

for page in root.findall('page'):
  ns = int(page.find('ns').text)
  word = page.find('title').text
  if ns == 0:
      print word

jdotjdot · Accepted Answer · 2013-05-14T12:28:59.640

2

I recommend using BeautifulSoup where you can for something like this because it's just so easy to use.

from bs4 import BeautifulSoup as BS
# given your html as the variable 'html'
soup = BS(html, "xml")
pages = soup.find_all('page')
for page in pages:
    if page.ns.text == '0':
        print page.title.text

As far as I can tell here, no need to use int to convert your <ns> tag to an integer to compare against == 0. Comparing against the string '0' works just as well--even easier, in this case, since you wouldn't have to deal with conversion at all.

edited May 14 '13 at 12:28

answered May 14 '13 at 03:13

jdotjdot

16,134
13
66
118

This works on the small xml file. But then when I parse the long (128mb) xml dump the script crashes... or at list I think crashed since it didn't complete the job after more than two hours. Is there any strategy to get it working on very large files? – CptNemo May 14 '13 at 10:30
Ah, BeautifulSoup does tend to be very slow when working with large files--in that case, you might want to use `lxml`. What you could give a shot is running BeautifulSoup on top of the `lxml` XML parser--I've updated my response to show how you do that within the BeautifulSoup constructor, adding `"xml"` as the second argument. – jdotjdot May 14 '13 at 12:30
Now, I got error `TypeError: unsupported operand type(s) for +: 'NoneType' and `'str' with this line: soup = `BS(open("itwiktionary-20130507-pages-articles.xml"), "xml")`. Am I doing something wrong? – CptNemo May 14 '13 at 22:46
Yes, you are--should be `open("itwiktionary-20130507-pages-articles.xml", "r")`--but you should probably ask these as another question. – jdotjdot May 14 '13 at 22:51
1

The `TypeError: unsupported operand type(s) for...` was given because of a known bug of BS. I solved it upgrading. Still, the document seems way to big for BS. It keeps crashing. – CptNemo May 15 '13 at 12:30

Parse XML dump of a MediaWiki wiki

1 Answers1