Adding a column that gathers paragraph text content to each scraped hyperlink in a function that pulls together story titles and their links

Question

Trying to solve this puzzle in RStudio and it's a struggle. Using code for this experiment that was originally provided by user @SCDCE a couple years ago here (cite: How to scrape Google News results into a data.frame with rvest), but apparently I lack the Stack Juju to ask a follow up comment/question on that feed.

I'm trying to figure out how to add a column to the two generated in the code below, that is titled "Text" or "Content", and to have it automatically pull said content together in the rows observation cell that corresponds to each title and link. The goal is to then use this df$Text content for sentiment analysis etc...

Here is the code from SCDCE, with the only inclusion being saving to the "c19_news" df at the end:

library(rvest)
library(tidyverse)


news <- function(term) {
  
  html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))

  dat <- data.frame(Link = html_dat %>%
                      html_nodes('.VDXfz') %>% 
                      html_attr('href')) %>% 
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
  
  news_dat <- data.frame(
    Title = html_dat %>%
      html_nodes('.DY5T1d') %>% 
      html_text(),
    Link = dat$Link
  )
  
  return(news_dat)
}

c19_news <- news("coronavirus")

...And here I tried many variations on THIS to accomplish the goal (to no resolution):

library(rvest)
library(tidyverse)


news <- function(term) {
  
  html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))

  dat <- data.frame(Link = html_dat %>%
                      html_nodes('.VDXfz') %>% 
                      html_attr('href')) %>% 
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
  
  news_dat <- data.frame(
    Title = html_dat %>%
      html_nodes('.DY5T1d') %>% 
      html_text(),
    Link = dat$Link,
    #Added this section below:
    Text = html_dat %>% 
      read_html() %>% 
      html_nodes('li , .p') %>%
      html_text()

  )
  
  return(news_dat)
}

c19_news <- news("coronavirus")

...And here is the error it throws w the trackback:

Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')"
12.
read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)
11.
withCallingHandlers(expr, warning = function(w) if (inherits(w,
classes)) tryInvokeRestart("muffleWarning"))
10.
suppressWarnings(read_xml(x, encoding = encoding, ..., as_html = TRUE,
options = options))
9.
read_html.default(.)
8.
read_html(.)
7.
html_elements(...)
6.
html_nodes(., "li , .p")
5.
xml_text(x, trim = trim)
4.
html_text(.)
3.
html_dat %>% read_html() %>% html_nodes("li , .p") %>%
html_text()
2.
data.frame(Title = html_dat %>% html_nodes(".DY5T1d") %>% html_text(),
Link = dat$Link, Text = html_dat %>% read_html() %>% html_nodes("li , .p") %>%
html_text())
1.
news("coronavirus")

Can anyone help with this? Thank you!

Where is this `li, .p` located and what does in contains on the Google News website? — Chamkrai, Jul 28 '22 at 15:25
Hi Tom. Ty for responding. The li, .p was one of my attempts to set the attributes for the corresponding news pages. I used SelectorGadget to grab the "p" tag (I tried just this tag only, with and without the period), and some pages had bulleted lists that were in the text which I wanted too, and the gadget called these as "li"... On some other news querries, this P/paragraph tag is labelled very differently (For instance: ".responsiveNews"), but I just want this to iterate through the links and grab all the paragraph text on the pages... — Bodhi, Jul 28 '22 at 16:10

Adding a column that gathers paragraph text content to each scraped hyperlink in a function that pulls together story titles and their links

0 Answers0