Trying to solve this puzzle in RStudio and it's a struggle. Using code for this experiment that was originally provided by user @SCDCE a couple years ago here (cite: How to scrape Google News results into a data.frame with rvest), but apparently I lack the Stack Juju to ask a follow up comment/question on that feed.
I'm trying to figure out how to add a column to the two generated in the code below, that is titled "Text" or "Content", and to have it automatically pull said content together in the rows observation cell that corresponds to each title and link. The goal is to then use this df$Text content for sentiment analysis etc...
Here is the code from SCDCE, with the only inclusion being saving to the "c19_news" df at the end:
library(rvest)
library(tidyverse)
news <- function(term) {
html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link
)
return(news_dat)
}
c19_news <- news("coronavirus")
...And here I tried many variations on THIS to accomplish the goal (to no resolution):
library(rvest)
library(tidyverse)
news <- function(term) {
html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
#Added this section below:
Text = html_dat %>%
read_html() %>%
html_nodes('li , .p') %>%
html_text()
)
return(news_dat)
}
c19_news <- news("coronavirus")
...And here is the error it throws w the trackback:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')"
12.
read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)
11.
withCallingHandlers(expr, warning = function(w) if (inherits(w,
classes)) tryInvokeRestart("muffleWarning"))
10.
suppressWarnings(read_xml(x, encoding = encoding, ..., as_html = TRUE,
options = options))
9.
read_html.default(.)
8.
read_html(.)
7.
html_elements(...)
6.
html_nodes(., "li , .p")
5.
xml_text(x, trim = trim)
4.
html_text(.)
3.
html_dat %>% read_html() %>% html_nodes("li , .p") %>%
html_text()
2.
data.frame(Title = html_dat %>% html_nodes(".DY5T1d") %>% html_text(),
Link = dat$Link, Text = html_dat %>% read_html() %>% html_nodes("li , .p") %>%
html_text())
1.
news("coronavirus")
Can anyone help with this? Thank you!