1

While web-scraping research articles with R, I encountered the HTML code where a <div></div> tag is nested within a <p></p> tag, which apparently is ungrammatical.

I nevertheless wish to retrieve the entire text within the <p></p> tag.

If I simply do the following, whatever text that comes after the closing </div> tag is ignored, because apparently a closing </p> and a line break are automatically inserted before the <div> tag.

In the example below, what I want to retrieve is "text1text3" rather than just "text1".

> library("rvest"); library("tidyverse")
> x <- read_html("<p>text1<div>text2</div>text3</p>")
> x %>% html_nodes("p") %>% html_text()
[1] "text1"
> x
{xml_document}
<html>
[1] <body>\n<p>text1</p>\n<div>text2</div>text3</body>

Is there a way to do this? Any pointer would be appreciated.

Clarification:

What I want to do is to retrieve the text of <p>-nodes, wherever they are placed. They are often nested within <div></div>, or may contain <div></div> as in the example above. I prefer excluding the text of <div>-nodes nested within <p>-nodes, but either is fine. So I wish to exract "text2text4" (or "text2text3text4", with my preference for the former) in the following: <div>text1<p>text2<div>text3</div>text4</p>text5</div>.

Akira Murakami
  • 463
  • 1
  • 4
  • 14
  • Can you replace the leading and trailing Ps with DIV before parsing? – mplungjan Jul 17 '19 at 14:34
  • I think that's the last resort. The files I'm working on include many other `
    ` tags that are not within `

    ` tags, and so if I replace `

    ` with `
    `, it would retrieve the text that I do not wish to retrieve as well. I would nevertheless do it and delete unnecessary text if this is the only option.
    – Akira Murakami Jul 17 '19 at 15:55
  • Can you do `html_text(x)` and then `stringr::str_remove(..., html_nodes(x, "div") %>% html_text())` – Brian Jul 18 '19 at 00:15
  • It works perfectly in the first example, but unfortunately returns "text1text2text4text5" in the second example (under 'Clarification'). Sorry for the confusion over the specification of the task I'm tackling. – Akira Murakami Jul 21 '19 at 09:00
  • In XPath, `string(/div/p)` will result in `text2text3text4`. – Alejandro Jul 25 '19 at 22:49

1 Answers1

0

Here is the pure xpath solution to get text2text3text4 as output.

string-join(//p/descendant-or-self::*/text(),'')

Screenshot:

enter image description here

supputuri
  • 13,644
  • 2
  • 21
  • 39
  • I'm sure this XPath expression works in the way I want it to. Unfortunately, however, the `rvest` package in R (or the `xml2` package underlying it) does not appear to support XPath 2.0, which I believe is necessary to use the `string-join` function. `xml_find_all(x, "string-join(//*[@class='content']/blockquote/text()[normalize-space()], ' ')")` leads to an error: `xmlXPathCompOpEval: function string-join not found`. – Akira Murakami Jul 21 '19 at 09:04