0

Provided I have a XML as follows:

<node1>
    <text title='book'>
       <div chapter='0'>
          <div id='theNode'>
              <p xml:id="40">
               A House that has:
                   <p xml:id="45">- a window;</p>
                   <p xml:id="46">- a door</p>
                   <p xml:id="46">- a door</p>
               its a beuatiful house
               </p>
          </div>
       </div>
    </text>
</node1>

I would like to locate text title and get all the text from the first p tag appearing inside the text title book node

so far I know:

from lxml import etree
XML_tree = etree.fromstring(XML_content,parser=parser)
text = XML_tree.xpath('//text[@title="book"]/div/div/p/text()') 

gets: "A house that has is a beautiful house"

But I would like also all the text of all the possible children and great children of the first

appearing under

basically; look for then look for the first

and give me all the text under that p tag whatever the nesting.

pseudo code:

text = XML_tree.xpath('//text[@title="book"]/... any number of nodes.../p/ ....all text under p') 

Thanks.

JFerro
  • 3,203
  • 7
  • 35
  • 88

2 Answers2

3

Try using either string() or normalize-space()...

from lxml import etree

XML_content = """
<node1>
    <text title='book'>
       <div chapter='0'>
          <div id='theNode'>
              <p xml:id="x40">
               A House that has:
                   <p xml:id="x45">- a window;</p>
                   <p xml:id="x46">- a door</p>
                   <p xml:id="x47">- a door</p>
               its a beuatiful house
               </p>
          </div>
       </div>
    </text>
</node1>
"""

XML_tree = etree.fromstring(XML_content)
text = XML_tree.xpath('string(//text[@title="book"]/div/div/p)')
# text = XML_tree.xpath('normalize-space(//text[@title="book"]/div/div/p)')
print(text)

Output using string()...


               A House that has:
                   - a window;
                   - a door
                   - a door
               its a beuatiful house

Output using normalize-space()...

A House that has: - a window; - a door - a door its a beuatiful house
Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
  • Quick question, if you don't mind: I notice you inserted an `x` before the attribute value of each `xml:id`; why was that necessary? – Jack Fleeting Jun 19 '20 at 16:32
  • 2
    @JackFleeting - a value starting with a number is not a valid xml id. I had to add the “x” and change the last id from x46 to x47 to be able to get lxml to parse the xml. – Daniel Haley Jun 19 '20 at 16:41
  • Got it; thanks! FYI, I managed to get the same results with the exact version of OP's example (including the duplicated id) using lxml.html... – Jack Fleeting Jun 19 '20 at 16:44
  • @JackFleeting yeah lxml.html is pretty good at dealing with xml-like (not well-formed) data. It’s a lot easier when the original question has code you can use reproduce the output without any modifications. – Daniel Haley Jun 19 '20 at 16:46
  • @DanielHaley my XML id comes like this: – JFerro Jun 21 '20 at 12:26
  • @DanielHaley you are right, I am getting an error like: xml:id : attribute value 6bb011667...etc is not an NCName. The question is then how to parse a XML like this. – JFerro Jun 21 '20 at 13:16
0

Another option :

XML_tree = etree.fromstring(XML_content)
text = [el.strip() for el in XML_tree.xpath('//text()[ancestor::text[@title="book"]][normalize-space()]')]
print(" ".join(text))
print("\n".join(text))

Output :

A House that has: - a window; - a door - a door its a beuatiful house
A House that has:
- a window;
- a door
- a door
its a beuatiful house
E.Wiest
  • 5,425
  • 2
  • 7
  • 12