Parsing an XML file with invalid xml:id values (starting with a number)

Question

Provided I have a XML as follows: Notice that the attributes xml:id are strings STARTING BY NUMBERS

<node1>
    <text xml:id='7865ft6zh67'>
       <div chapter='0'>
          <div id='theNode'>
              <p xml:id="40">
               A House that has:
                   <p xml:id="45">- a window;</p>
                   <p xml:id="46">- a door</p>
                   <p xml:id="46">- a door</p>
               its a beuatiful house
               </p>
          </div>
       </div>
    </text>
</node1>

I would like to locate text title and get all the text from the first p tag appearing inside the text title book node

A first approach can be done using the answers here: lxml xpath expression for selecting all text under a given child node including his children (my own question)

But in this new XML (compared to the mentioned question) the xml:id s start with a number and as pointed in one of that answers the following error occurs when using the code:

 xml:id : attribute value 7865ft6zh67 is not an NCName, line 3, column 31

How can I still parse the XML with that "XML non compliance xml:id"?

So far the only solution that I can think of is passing the xml to string, and adding a letter at the beginning of every of those xml:ids" like:

newXML = '...hange><change xml:id="6f58f74883d55b...'
newXML_repared = newXML.replace('xml:id="','xml:id="XXid')
newXML_repared

from lxml import etree
XML_tree = etree.fromstring(newXML_repared,parser=parser)

but when doing so I get:

 ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Any suggestion?

note: I noticed that the string itself starts by:

<?xml version="1.0" encoding="UTF-8"?>
<teiCorpus subtype="simple"  ...etc

In the lxml tutorial is possible to read: This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding: (https://lxml.de/parsing.html)

But I still dont know how to solve the problem then

Thanks.

Preferably not using BS because the whole rest of the team uses lxml, nobody in the team uses BS and the idea is sticking to one library. — JFerro, Jun 21 '20 at 22:50
And apparently "BeautifulSoup, by itself, does not support XPath expressions.". we need xpath because the xmls we work with are extremely complicated and nested. But thanks for your answer — JFerro, Jun 21 '20 at 22:51
Where does the bad XML come from? This should be fixed by whatever/whoever created it. — mzjn, Jun 22 '20 at 07:26

score 2 · Accepted Answer · answered Jun 21 '20 at 23:20

One option is found in the link to the docs you provided (https://lxml.de/parsing.html).

Specifically the "recover" option listed in parser options.

Example...

from lxml import etree

XML_content = """
<node1>
    <text xml:id='7865ft6zh67' title="book">
       <div chapter='0'>
          <div id='theNode'>
              <p xml:id="40">
               A House that has:
                   <p xml:id="45">- a window;</p>
                   <p xml:id="46">- a door</p>
                   <p xml:id="46">- a door</p>
               its a beuatiful house
               </p>
          </div>
       </div>
    </text>
</node1>
"""

parser = etree.XMLParser(recover=True)

XML_tree = etree.fromstring(XML_content, parser=parser)
text = XML_tree.xpath('normalize-space(//text[@title="book"]/div/div/p)')
# text = XML_tree.xpath('string(//text[@title="book"]/div/div/p)')
print(text)

Note: I added title="book" so the XPath from my other answer in your related question still worked.

lxml's `recover=True` can be useful for [cleaning up "bad" XML](https://stackoverflow.com/q/44765194/290085), but readers should be reminded that problems such as starting ids with digits violate the rules of well-formedness, and so really should be fixed at the source. Otherwise, every consumer of the "XML" has to suffer these problems, defeating the benefits of using standards. — kjhughes, Jun 22 '20 at 00:46

Parsing an XML file with invalid xml:id values (starting with a number)

1 Answers1