1

I want to extract the text contained in the red and green rectangles as shown on the screenshot below, N.B : the text is not contained in an opening and closing tag

http://temperate.theferns.info/plant/Acacia+omalophylla

enter image description here

for example, for the text of the green rectangle, I tested this xpath query and the following code(python/selenium) :

greenrec_xpath = "//*[preceding::h3[contains(text(), 'General Information')] and following::h3[contains(text(), 'Known Hazards')]]"
driver.find_elements_by_xpath(greenrec_xpath)

but did not have the results expected

any ideas !

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Bertrand
  • 341
  • 1
  • 2
  • 12

3 Answers3

1

When there is no immediate surrounding bracket around text it is known as a text node, and is a bit trickier to find as it cannot be accessed directly like how you are attempting to do. What I usually have to do is to find the location of the immediate parent, and get the text from that. This gets a little trickier if there are multiple text nodes under that parent and will usually require some parsing/splitting after you get the entire text.

Alternatively, if you are in a situation where you can guarantee that your text node contains some specific text, you can swap text() with . and make the xpath that way. For example: //*[contains(.,'Acacia omalophylla')]

1
greenrec_xpath = 
 "//*[preceding::h3[contains(text(), 'General Information')] 
    and following::h3[contains(text(), 'Known Hazards')]]"

You are quite close to finding an XPath expression that selects the wanted text nodes:

Use:

//*[preceding::h3[1][contains(., 'General Information')] 
  and following::h3[1][contains(., 'Known Hazards')]
   ]/text()[normalize-space()]

Be aware that this expression selects many text nodes (in this particular case 5).

If you want to get a single string, you need to get the string values of each selected text node and concatenate these together in a single string. In case you can only use XPath 1.0, you will need to do this string concatenation in the calling programming (non-XPath) code.

If you can use XPath 2.0 (or later version) use:

string-join(
            //*[preceding::h3[1][contains(., 'General Information')] 
              and following::h3[1][contains(., 'Known Hazards')]
               ]/text()[normalize-space()]/string(.)
            ,
             ''
           )
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
1

To extract the text Classification of the genus Acacia... as the element is a text node you need to induce WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategy:

  • Code Block:

    driver.get("http://temperate.theferns.info/plant/Acacia+omalophylla")
    print(driver.execute_script('return arguments[0].childNodes[11].textContent;', WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.PageBox")))).strip())
    
  • Console Output :

    Classification of the genus Acacia (in the wider sense) has been subject to considerable debate. It is generally agreed that there are valid reasons for breaking it up into several distinct genera, but there has been disagreement over the way this should be done. As of 2017, it is widely (but not completely) accepted that the section that includes the majority of the Australian species (including this one) should retain the name Acacia, whilst other sections of the genus should be transferred to the genera Acaciella, Mariosousa, Senegalia and Vachellia[
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352