1

Through other SO questions I've found how to get headlines but I don't know where the Google code stores the links.

I'm wanting a 2 column data.frame of the headlines and their corresponding links.

library(rvest)
library(tidyverse)


dat <- read_html("https://news.google.com/search?q=coronavirus&hl=en-US&gl=US&ceid=US%3Aen") %>%
  html_nodes('.DY5T1d') %>% #
  html_text()

dat
SCDCE
  • 1,603
  • 1
  • 15
  • 28
  • Google is a bit difficult to scrape. :) All links should be save in "href". If you have some difficult, maybe you should use the Rselenium. In this way you will be able to navigate the web site. – Earl Mascetti Mar 05 '20 at 15:22
  • I found the description reference in the source code but still no idea what the links are stored under – SCDCE Mar 05 '20 at 15:36
  • Did you try to follow this https://stackoverflow.com/questions/35247033/using-rvest-to-extract-links ? – Earl Mascetti Mar 05 '20 at 15:41

1 Answers1

3

After a lot of inspecting the Google web code I found what I was looking for. I also came across the descriptions so I basically re-built the Google news RSS feed.

library(rvest)
library(tidyverse)


news <- function(term) {
  
  html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))

  dat <- data.frame(Link = html_dat %>%
                      html_nodes('.VDXfz') %>% 
                      html_attr('href')) %>% 
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
  
  news_dat <- data.frame(
    Title = html_dat %>%
      html_nodes('.DY5T1d') %>% 
      html_text(),
    Link = dat$Link
  )
  
  return(news_dat)
}

news("coronavirus")
SCDCE
  • 1,603
  • 1
  • 15
  • 28
  • 1
    Scraping tip, Your code above is calling `read_html(url)` twice inside the function. You should read the webpage using `page<-read_html(url)` and then use this variable "page" to parse the data. This will improve the script's performance and reduce the number of page hits to the server. Please read the terms of service on the website prior to use. FYI: Generally scraping is in violation of the terms. – Dave2e Mar 05 '20 at 17:49
  • The only downside with this, is that the classes could change at any time and crash your program :/ – Justin Dalrymple May 21 '20 at 17:27
  • Looks like Google recently dropped the article descriptions. I imagine there are snippets somewhere but who knows where those are... – SCDCE Nov 02 '21 at 17:53