1

Attempting to add a root tag to the beginning and end of a 2mil line XML file so the file can be properly processed with my Python code.

I tried using this code from a previous post, but I am getting an error "XMLSyntaxError: Extra content at the end of the document, line __, column 1"

How do I solve this? Or is there a better way to add a root tag to the beginning and end of my large XML doc?

import lxml.etree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
newroot = ET.Element("root")
newroot.insert(0, root)
print(ET.tostring(newroot, pretty_print=True))

My test XML

<pub>
    <ID>75</ID>
    <title>Use of Lexicon Density in Evaluating Word Recognizers</title>
    <year>2000</year>
    <booktitle>Multiple Classifier Systems</booktitle>
    <pages>310-319</pages>
    <authors>
        <author>Petr Slav&iacute;k</author>
        <author>Venu Govindaraju</author>
    </authors>
</pub>
<pub>
    <ID>120</ID>
    <title>Virtual endoscopy with force feedback - a new system for neurosurgical training</title>
    <year>2003</year>
    <booktitle>CARS</booktitle>
    <pages>782-787</pages>
    <authors>
        <author>Christos Trantakis</author>
        <author>Friedrich Bootz</author>
        <author>Gero Strau&szlig;</author>
        <author>Edgar Nowatius</author>
        <author>Dirk Lindner</author>
        <author>H&uuml;seyin Kem&acirc;l &Ccedil;akmak</author>
        <author>Heiko Maa&szlig;</author>
        <author>Uwe G. K&uuml;hnapfel</author>
        <author>J&uuml;rgen Meixensberger</author>
    </authors>
</pub>
Community
  • 1
  • 1
  • 1
    Your test.xml document has no root element so it is not really XML and cannot be parsed as such. – mzjn Apr 24 '17 at 21:30
  • @mzjn you missed the point, I was attempting to add the root tag so it could be read as XML. – douglasrcjames_old Apr 25 '17 at 03:03
  • Well, my point was that you tried to parse test.xml as XML before adding the root element. That's why you got the error. – mzjn Apr 25 '17 at 06:45

1 Answers1

1

I suspect that that gambit works because there is only one A element at the highest level. Fortunately, even with two million lines it's easy to add the lines you need.

In doing this I noticed that the lxml parser seems unable to process the accented characters. I have there added code to anglicise them.

import re

def anglicise(matchobj): return matchobj.group(0)[1]

outputFilename = 'result.xml'

with open('test.xml') as inXML, open(outputFilename, 'w') as outXML:
    outXML.write('<root>\n')
    for line in inXML.readlines():
        outXML.write(re.sub('&[a-zA-Z]+;',anglicise,line))
    outXML.write('</root>\n')

from lxml import etree

tree = etree.parse(outputFilename)
years = tree.xpath('.//year')
print (years[0].text)

Edit: Replace anglicise to this version to avoid replacing &amp;.

def anglicise(matchobj): 
    if matchobj.group(0) == '&amp;':
        return matchobj.group(0)
    else:
        return matchobj.group(0)[1]
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • Awesome! I am getting a file output with everything, however, I have some entries in the bigger XML file with an `&` for '&' , and the code is converting them, but to an 'a' character. I don't quite understand the `outXML.write(re.sub('&[a-z]+;',anglicise,line))` part of your code, how do I adjust that to process the &? – douglasrcjames_old Apr 25 '17 at 03:02
  • Sorry one last thing, my code is now getting stuck up on the second, but not the first author with the `ß` character in their name. I attached the xml node in my (edited) question. Seems odd to me that it wouldn't stop at the first, but the second throw an error – douglasrcjames_old Apr 25 '17 at 03:55
  • I've changed the regex. – Bill Bell Apr 25 '17 at 12:13