Provided I have a XML as follows: Notice that the attributes xml:id are strings STARTING BY NUMBERS
<node1>
<text xml:id='7865ft6zh67'>
<div chapter='0'>
<div id='theNode'>
<p xml:id="40">
A House that has:
<p xml:id="45">- a window;</p>
<p xml:id="46">- a door</p>
<p xml:id="46">- a door</p>
its a beuatiful house
</p>
</div>
</div>
</text>
</node1>
I would like to locate text title and get all the text from the first p tag appearing inside the text title book node
A first approach can be done using the answers here: lxml xpath expression for selecting all text under a given child node including his children (my own question)
But in this new XML (compared to the mentioned question) the xml:id s start with a number and as pointed in one of that answers the following error occurs when using the code:
xml:id : attribute value 7865ft6zh67 is not an NCName, line 3, column 31
How can I still parse the XML with that "XML non compliance xml:id"?
So far the only solution that I can think of is passing the xml to string, and adding a letter at the beginning of every of those xml:ids" like:
newXML = '...hange><change xml:id="6f58f74883d55b...'
newXML_repared = newXML.replace('xml:id="','xml:id="XXid')
newXML_repared
from lxml import etree
XML_tree = etree.fromstring(newXML_repared,parser=parser)
but when doing so I get:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
Any suggestion?
note: I noticed that the string itself starts by:
<?xml version="1.0" encoding="UTF-8"?>
<teiCorpus subtype="simple" ...etc
In the lxml tutorial is possible to read: This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding: (https://lxml.de/parsing.html)
But I still dont know how to solve the problem then
Thanks.