xpath
inside
empty

Question

I started to work with xpath in python3 and are facing this behaviour. It seems very wrong to me. Why does it match span-text, but not p-text inside h3?

>>> from lxml import etree

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]

>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']

Thanks a lot!

har07 · Accepted Answer · 2018-01-13T22:06:05.017

Your first XPath correctly returned no result because <h3> in the corresponding tree didn't contain any text node. You can use tostring() method to see the actual content of the tree :

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'

The parser probably did this -turned h3 into empty element- because it considers paragraph inside a heading tag not valid (while span inside heading is valid) : Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?

To keep p elements inside h3 you can try using different parser i.e using BeautifulSoup's parser :

>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'

Is there any way to work around this problem? I'm scraping websites and many websites use this syntax. Can I somehow change the way it reads the html? Beside: I'm able to put div-Elements inside h3, althought div is als not a "Phrasing content"-Element. — Florian, Jan 13 '18 at 12:22

xpath inside empty

empty

1 Answers1

xpath
inside
empty