While web-scraping research articles with R, I encountered the HTML code where a <div></div>
tag is nested within a <p></p>
tag, which apparently is ungrammatical.
I nevertheless wish to retrieve the entire text within the <p></p>
tag.
If I simply do the following, whatever text that comes after the closing </div>
tag is ignored, because apparently a closing </p>
and a line break are automatically inserted before the <div>
tag.
In the example below, what I want to retrieve is "text1text3" rather than just "text1".
> library("rvest"); library("tidyverse")
> x <- read_html("<p>text1<div>text2</div>text3</p>")
> x %>% html_nodes("p") %>% html_text()
[1] "text1"
> x
{xml_document}
<html>
[1] <body>\n<p>text1</p>\n<div>text2</div>text3</body>
Is there a way to do this? Any pointer would be appreciated.
Clarification:
What I want to do is to retrieve the text of <p>
-nodes, wherever they are placed. They are often nested within <div></div>
, or may contain <div></div>
as in the example above. I prefer excluding the text of <div>
-nodes nested within <p>
-nodes, but either is fine. So I wish to exract "text2text4" (or "text2text3text4", with my preference for the former) in the following: <div>text1<p>text2<div>text3</div>text4</p>text5</div>
.