How to extract text content between two node

Question

I want to extract the text contained in the red and green rectangles as shown on the screenshot below, N.B : the text is not contained in an opening and closing tag

http://temperate.theferns.info/plant/Acacia+omalophylla

for example, for the text of the green rectangle, I tested this xpath query and the following code(python/selenium) :

greenrec_xpath = "//*[preceding::h3[contains(text(), 'General Information')] and following::h3[contains(text(), 'Known Hazards')]]"
driver.find_elements_by_xpath(greenrec_xpath)

but did not have the results expected

any ideas !

You can get the first part by doing By.Xpath("//div[@class='family']/following-sibling::br[1]") and then .Text(). Second lot is trickier as there are no tags. — ratsstack, Nov 20 '19 at 23:45
Bertrand, Did you try my solution? Does it solve the problem? — Dimitre Novatchev, Nov 21 '19 at 22:10
Thanks! Dimitre Novatchev, tomorrow I will post the solution! — Bertrand, Nov 22 '19 at 00:30

score 1 · Answer 1 · answered Nov 20 '19 at 17:40

When there is no immediate surrounding bracket around text it is known as a text node, and is a bit trickier to find as it cannot be accessed directly like how you are attempting to do. What I usually have to do is to find the location of the immediate parent, and get the text from that. This gets a little trickier if there are multiple text nodes under that parent and will usually require some parsing/splitting after you get the entire text.

Alternatively, if you are in a situation where you can guarantee that your text node contains some specific text, you can swap text() with . and make the xpath that way. For example: //*[contains(.,'Acacia omalophylla')]

score 1 · Answer 2 · answered Nov 21 '19 at 03:20

greenrec_xpath = 
 "//*[preceding::h3[contains(text(), 'General Information')] 
    and following::h3[contains(text(), 'Known Hazards')]]"

You are quite close to finding an XPath expression that selects the wanted text nodes:

Use:

//*[preceding::h3[1][contains(., 'General Information')] 
  and following::h3[1][contains(., 'Known Hazards')]
   ]/text()[normalize-space()]

Be aware that this expression selects many text nodes (in this particular case 5).

If you want to get a single string, you need to get the string values of each selected text node and concatenate these together in a single string. In case you can only use XPath 1.0, you will need to do this string concatenation in the calling programming (non-XPath) code.

If you can use XPath 2.0 (or later version) use:

string-join(
            //*[preceding::h3[1][contains(., 'General Information')] 
              and following::h3[1][contains(., 'Known Hazards')]
               ]/text()[normalize-space()]/string(.)
            ,
             ''
           )

score 1 · Answer 3 · answered Nov 21 '19 at 09:41

To extract the text Classification of the genus Acacia... as the element is a text node you need to induce WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategy:

Code Block:

driver.get("http://temperate.theferns.info/plant/Acacia+omalophylla")
print(driver.execute_script('return arguments[0].childNodes[11].textContent;', WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.PageBox")))).strip())

Console Output :

Classification of the genus Acacia (in the wider sense) has been subject to considerable debate. It is generally agreed that there are valid reasons for breaking it up into several distinct genera, but there has been disagreement over the way this should be done. As of 2017, it is widely (but not completely) accepted that the section that includes the majority of the Australian species (including this one) should retain the name Acacia, whilst other sections of the genus should be transferred to the genera Acaciella, Mariosousa, Senegalia and Vachellia[

How to extract text content between two node

3 Answers3