2

How do you access a text in an XPath if it doesn't have a node? The text is in quotation marks and on seperate line inside another node

I'm having trouble choosing the correct element in an XPath

 <span>
    <a href="www.imagine_a_link_here.org">
      "
                This is the text I need to access
             "
    </a>
 </span>

I'd normally do this by writing

import requests
from lxml import html,etree
from lxml.html import document_fromstring

page = requests.get('https://www.the_link_im_trying_to_webscrape.org')
tree = html.fromstring(page.content)
the_text_i_need_to_access_xpath = '/span/a/text()'
the_text_i_need_to_access = tree.xpath(the_text_i_need_to_access_xpath)

Unfortunately this is only returning an empty list. Does anyone know how I have to modify the XPath in order to get the string I'm looking for?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Alfred
  • 21
  • 1
  • 1
    For the sample HTML you posted, that XPath is correct and should select the text inside of the anchor. Are you sure that the HTML doesn't include other wrapping elements, such as `` and ``? You might try using a more generic XPath with the descendant axis: i.e. `//span/a/text()` – Mads Hansen Feb 10 '21 at 21:18
  • @MadsHansen: if you hard code the text, it will not give a result – Thomas Weller Feb 10 '21 at 21:20

1 Answers1

2

How do you access a text in an XPath if it doesn't have a node?

Text in an XML or HTML document will be associated with a node. That's not the problem here. And the " " delimiters are just there to show you surrounding whitespace.

As presented your XPath should select the text within the a element. Here're some reasons that may not be happening:

  1. As @MadsHansen mentioned in comments, the root element of your actual HTML may not be a span as shown. See:

  2. The text may not be loaded at the time of your XPath execution because the document hasn't completely loaded or because JavaScript dynamically changes the DOM later. See:

  3. fromstring() can use a bit more magic than might be expected:

fromstring(string): Returns document_fromstring or fragment_fromstring, based on whether the string looks like a full document, or just a fragment.

Given this, here is an update to your code that will select the targeted text as expected:

import requests
from lxml import html
from lxml.html import document_fromstring

htmlstr = """
<span>
   <a href="www.imagine_a_link_here.org">
     "
               This is the text I need to access
            "
   </a>
</span>
"""

tree = html.fromstring(htmlstr)
print(html.tostring(tree))
the_text_i_need_to_access_xpath = '//span/a/text()'
the_text_i_need_to_access = tree.xpath(the_text_i_need_to_access_xpath)
print(the_text_i_need_to_access)

Or, if you don't need/want the HTML surprises, this also selects the text:

import lxml.etree as ET

xmlstr = """
 <span>
    <a href="www.imagine_a_link_here.org">
      "
                This is the text I need to access
             "
    </a>
 </span>
"""

root = ET.fromstring(xmlstr)
print(root.xpath('/span/a/text()'))

Credit: Thanks to @ThomasWeller for pointing out the additional complications and helping to resolve them.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • If you hard code the text, it will not give a result, even if span is the root element. Neither / nor /span selects something. – Thomas Weller Feb 10 '21 at 21:30
  • @ThomasWeller: Hi Thomas, tell me what you mean by hard-coding the text -- I'm not following. It seems like you're saying that if the XPath references the targeted text improperly, it won't match. Agree there, but OP references it only as `text()`, so not sure I'm following your point. – kjhughes Feb 10 '21 at 21:35
  • Lol, yeah, that's a bit of mess. It's going to be hard to reproduce without a server that returns OP's document. – kjhughes Feb 10 '21 at 21:38
  • https://chat.stackoverflow.com/rooms/228547/lxml-htmlelement – Thomas Weller Feb 10 '21 at 21:38
  • No need to get a string from a server. Just hard-code it – Thomas Weller Feb 10 '21 at 21:39
  • thanks for the help kjhughes, i looked into your suggestions. Ive already iterated between "/" and "//" in every place i could think of and it didnt change the result. i also added a rudimentary timer so that the page definition but before the tree definition, but it also didnt help :/ – Alfred Feb 10 '21 at 22:03
  • @Alfred: Ah, I see now what's confusing matters. See updated answer. – kjhughes Feb 10 '21 at 22:28
  • Thank you so much kjhughes and Thomas Weller! Looks very promising)) I will now sleep for 12 hrs and hopefully, when i wake up, ill be able to implement all of your suggestions! Cheers guys)) – Alfred Feb 10 '21 at 22:48