Retrieve text in the `
` tag that contains `
`

Question

While web-scraping research articles with R, I encountered the HTML code where a <div></div> tag is nested within a <p></p> tag, which apparently is ungrammatical.

I nevertheless wish to retrieve the entire text within the <p></p> tag.

If I simply do the following, whatever text that comes after the closing </div> tag is ignored, because apparently a closing </p> and a line break are automatically inserted before the <div> tag.

In the example below, what I want to retrieve is "text1text3" rather than just "text1".

> library("rvest"); library("tidyverse")
> x <- read_html("<p>text1<div>text2</div>text3</p>")
> x %>% html_nodes("p") %>% html_text()
[1] "text1"
> x
{xml_document}
<html>
[1] <body>\n<p>text1</p>\n<div>text2</div>text3</body>

Is there a way to do this? Any pointer would be appreciated.

Clarification:

What I want to do is to retrieve the text of <p>-nodes, wherever they are placed. They are often nested within <div></div>, or may contain <div></div> as in the example above. I prefer excluding the text of <div>-nodes nested within <p>-nodes, but either is fine. So I wish to exract "text2text4" (or "text2text3text4", with my preference for the former) in the following: <div>text1<p>text2<div>text3</div>text4</p>text5</div>.

Can you replace the leading and trailing Ps with DIV before parsing? — mplungjan, Jul 17 '19 at 14:34
I think that's the last resort. The files I'm working on include many other `
` tags that are not within `
` tags, and so if I replace `
` with `
`, it would retrieve the text that I do not wish to retrieve as well. I would nevertheless do it and delete unnecessary text if this is the only option. — Akira Murakami, Jul 17 '19 at 15:55
Can you do `html_text(x)` and then `stringr::str_remove(..., html_nodes(x, "div") %>% html_text())` — Brian, Jul 18 '19 at 00:15
It works perfectly in the first example, but unfortunately returns "text1text2text4text5" in the second example (under 'Clarification'). Sorry for the confusion over the specification of the task I'm tackling. — Akira Murakami, Jul 21 '19 at 09:00
In XPath, `string(/div/p)` will result in `text2text3text4`. — Alejandro, Jul 25 '19 at 22:49

score 0 · Answer 1 · answered Jul 20 '19 at 04:07

0

Here is the pure xpath solution to get text2text3text4 as output.

string-join(//p/descendant-or-self::*/text(),'')

Screenshot:

answered Jul 20 '19 at 04:07

supputuri

13,644
2
21
39

I'm sure this XPath expression works in the way I want it to. Unfortunately, however, the `rvest` package in R (or the `xml2` package underlying it) does not appear to support XPath 2.0, which I believe is necessary to use the `string-join` function. `xml_find_all(x, "string-join(//*[@class='content']/blockquote/text()[normalize-space()], ' ')")` leads to an error: `xmlXPathCompOpEval: function string-join not found`. – Akira Murakami Jul 21 '19 at 09:04

Retrieve text in the `` tag that contains ``

1 Answers1

Retrieve text in the `
` tag that contains `
`