3

I started to work with xpath in python3 and are facing this behaviour. It seems very wrong to me. Why does it match span-text, but not p-text inside h3?

>>> from lxml import etree

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
[]

>>> result = "<h3><span>Hallo</span></h3>"
>>> tree = etree.HTML(result)
>>> r = tree.xpath('//h3//text()')
>>> print(r)
['Hallo']

Thanks a lot!

Ronan Boiteau
  • 9,608
  • 6
  • 34
  • 56
Florian
  • 33
  • 3

1 Answers1

3

Your first XPath correctly returned no result because <h3> in the corresponding tree didn't contain any text node. You can use tostring() method to see the actual content of the tree :

>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = etree.HTML(result)
>>> etree.tostring(tree)
'<html><body><h3/><p>Hallo</p></body></html>'

The parser probably did this -turned h3 into empty element- because it considers paragraph inside a heading tag not valid (while span inside heading is valid) : Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?

To keep p elements inside h3 you can try using different parser i.e using BeautifulSoup's parser :

>>> from lxml.html import soupparser
>>> result = "<h3><p>Hallo</p></h3>"
>>> tree = soupparser.fromstring(result)
>>> etree.tostring(tree)
'<html><h3><p>Hallo</p></h3></html>'
har07
  • 88,338
  • 12
  • 84
  • 137
  • Is there any way to work around this problem? I'm scraping websites and many websites use this syntax. Can I somehow change the way it reads the html? Beside: I'm able to put div-Elements inside h3, althought div is als not a "Phrasing content"-Element. – Florian Jan 13 '18 at 12:22
  • Thanks! That helped! – Florian Jan 15 '18 at 16:37