How do you access a text in an XPath if it doesn't have a node?
Text in an XML or HTML document will be associated with a node. That's not the problem here. And the " "
delimiters are just there to show you surrounding whitespace.
As presented your XPath should select the text within the a
element. Here're some reasons that may not be happening:
As @MadsHansen mentioned in comments, the root element of your actual HTML may not be a span
as shown. See:
The text may not be loaded at the time of your XPath execution because the document hasn't completely loaded or because JavaScript dynamically changes the DOM later. See:
fromstring()
can use a bit more magic than might be expected:
fromstring(string)
:
Returns document_fromstring
or fragment_fromstring
, based on
whether the string looks like a full document, or just a fragment.
Given this, here is an update to your code that will select the targeted text as expected:
import requests
from lxml import html
from lxml.html import document_fromstring
htmlstr = """
<span>
<a href="www.imagine_a_link_here.org">
"
This is the text I need to access
"
</a>
</span>
"""
tree = html.fromstring(htmlstr)
print(html.tostring(tree))
the_text_i_need_to_access_xpath = '//span/a/text()'
the_text_i_need_to_access = tree.xpath(the_text_i_need_to_access_xpath)
print(the_text_i_need_to_access)
Or, if you don't need/want the HTML surprises, this also selects the text:
import lxml.etree as ET
xmlstr = """
<span>
<a href="www.imagine_a_link_here.org">
"
This is the text I need to access
"
</a>
</span>
"""
root = ET.fromstring(xmlstr)
print(root.xpath('/span/a/text()'))
Credit: Thanks to @ThomasWeller for pointing out the additional complications and helping to resolve them.