2

I have several large xml.bz2 files from which I need to parse.

I am interested in getting context in <text></text>. These xml files are mal-formatted, lacking <mediawiki xml:lang="en"></mediawiki xml:lang="en"> at the start and end (as https://en.wikipedia.org/wiki/Help:Export shows). As I run my code as below:

from lxml import etree
context = etree.iterparse("pages1.xml", tag = "text")

for event, elem in context :
    print elem.xpath( 'description/text( )' )
    elem.clear( )
    while elem.getprevious( ) is not None :
        del elem.getparent( )[0]

del context

I received the error

lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 256, column 3

I searched and found this post parsing large xml file with Python - etree.parse error, which suggests wrapping the entire XML with a tag. I am confused, however, how to add the main tag to an existing document. The structure of the xml file I have is as below. I highly appreciate your help. Thank you.

<page>
  <title>Page title</title>
  <!-- page namespace code -->
  <ns>0</ns>
  <id>2</id>
  <!-- If page is a redirection, element "redirect" contains title of the page redirect to -->
  <redirect title="Redirect page title" />
  <restrictions>edit=sysop:move=sysop</restrictions>
  <revision>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </revision>
  <revision>
    <timestamp>2001-01-15T13:10:27Z</timestamp>
    <contributor><ip>10.0.0.2</ip></contributor>
    <comment>new!</comment>
    <text>An earlier [[revision]].</text>
  </revision>
  <revision>
    <!-- deleted revision example -->
    <id>4557485</id>
    <parentid>1243372</parentid>
    <timestamp>2010-06-24T02:40:22Z</timestamp>
    <contributor deleted="deleted" />
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text deleted="deleted" />
    <sha1/>
  </revision>
</page>

<page>
  <title>Talk:Page title</title>
  <revision>
    <timestamp>2001-01-15T14:03:00Z</timestamp>
    <contributor><ip>10.0.0.2</ip></contributor>
    <comment>hey</comment>
    <text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
  </revision>
</page>
Community
  • 1
  • 1
yearntolearn
  • 1,064
  • 2
  • 17
  • 36

0 Answers0