0

Provided I have a XML as follows: Notice that the attributes xml:id are strings STARTING BY NUMBERS

<node1>
    <text xml:id='7865ft6zh67'>
       <div chapter='0'>
          <div id='theNode'>
              <p xml:id="40">
               A House that has:
                   <p xml:id="45">- a window;</p>
                   <p xml:id="46">- a door</p>
                   <p xml:id="46">- a door</p>
               its a beuatiful house
               </p>
          </div>
       </div>
    </text>
</node1>

I would like to locate text title and get all the text from the first p tag appearing inside the text title book node

A first approach can be done using the answers here: lxml xpath expression for selecting all text under a given child node including his children (my own question)

But in this new XML (compared to the mentioned question) the xml:id s start with a number and as pointed in one of that answers the following error occurs when using the code:

 xml:id : attribute value 7865ft6zh67 is not an NCName, line 3, column 31

How can I still parse the XML with that "XML non compliance xml:id"?

So far the only solution that I can think of is passing the xml to string, and adding a letter at the beginning of every of those xml:ids" like:

newXML = '...hange><change xml:id="6f58f74883d55b...'
newXML_repared = newXML.replace('xml:id="','xml:id="XXid')
newXML_repared

from lxml import etree
XML_tree = etree.fromstring(newXML_repared,parser=parser)

but when doing so I get:

 ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Any suggestion?

note: I noticed that the string itself starts by:

<?xml version="1.0" encoding="UTF-8"?>
<teiCorpus subtype="simple"  ...etc

In the lxml tutorial is possible to read: This requires, however, that unicode strings do not specify a conflicting encoding themselves and thus lie about their real encoding: (https://lxml.de/parsing.html)

But I still dont know how to solve the problem then

Thanks.

mzjn
  • 48,958
  • 13
  • 128
  • 248
JFerro
  • 3,203
  • 7
  • 35
  • 88
  • 1
    Preferably not using BS because the whole rest of the team uses lxml, nobody in the team uses BS and the idea is sticking to one library. – JFerro Jun 21 '20 at 22:50
  • And apparently "BeautifulSoup, by itself, does not support XPath expressions.". we need xpath because the xmls we work with are extremely complicated and nested. But thanks for your answer – JFerro Jun 21 '20 at 22:51
  • With `bs4` you can use CSS selectors + bs4's own api. – Andrej Kesely Jun 21 '20 at 22:54
  • 1
    Where does the bad XML come from? This should be fixed by whatever/whoever created it. – mzjn Jun 22 '20 at 07:26

1 Answers1

2

One option is found in the link to the docs you provided (https://lxml.de/parsing.html).

Specifically the "recover" option listed in parser options.

Example...

from lxml import etree

XML_content = """
<node1>
    <text xml:id='7865ft6zh67' title="book">
       <div chapter='0'>
          <div id='theNode'>
              <p xml:id="40">
               A House that has:
                   <p xml:id="45">- a window;</p>
                   <p xml:id="46">- a door</p>
                   <p xml:id="46">- a door</p>
               its a beuatiful house
               </p>
          </div>
       </div>
    </text>
</node1>
"""

parser = etree.XMLParser(recover=True)

XML_tree = etree.fromstring(XML_content, parser=parser)
text = XML_tree.xpath('normalize-space(//text[@title="book"]/div/div/p)')
# text = XML_tree.xpath('string(//text[@title="book"]/div/div/p)')
print(text)

Note: I added title="book" so the XPath from my other answer in your related question still worked.

Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
  • 1
    lxml's `recover=True` can be useful for [cleaning up "bad" XML](https://stackoverflow.com/q/44765194/290085), but readers should be reminded that problems such as starting ids with digits violate the rules of well-formedness, and so really should be fixed at the source. Otherwise, every consumer of the "XML" has to suffer these problems, defeating the benefits of using standards. – kjhughes Jun 22 '20 at 00:46