1

I'm trying to scrape & download csv files from a webpage with tons of csv's.

Code:

# Libraries
library(rvest)
library(httr)

# URL
url <- "http://data.gdeltproject.org/events/index.html"

# The csv's I want are from 14 through 378 (2018 year)
selector_nodes <- seq(from = 14, to = 378, by = 1)

# HTML read / rvest action
link <- url %>% 
  read_html() %>% 
  html_nodes(paste0("body > ul > li:nth-child(", (gdelt_nodes), ")> a")) %>% 
  html_attr("href")

I get this error:

 Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) : 
   Expecting a single string value: [type=character; extent=365].

How do I tell it I want the nodes 14 to 378 correctly?

After I can get that assigned, I'm going to run a quick for loop and download all of the 2018 csv's.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
papelr
  • 468
  • 1
  • 11
  • 42

1 Answers1

0

See the comments in the code for the step-by-step solution.

library(rvest)

# URL
url <- "http://data.gdeltproject.org/events/index.html"

# Read the page in once then attempt to process it.
page <- url %>% read_html() 

#extract the file list
filelist<-page %>% html_nodes("ul li a") %>% html_attr("href")

#filter for files from 2018
filelist<-filelist[grep("2018", filelist)]

#Loop would go here to download all of the pages
#pause between file downloads and then download a file
Sys.sleep(1)
download.file(paste0("http://data.gdeltproject.org/events/", filelist[1]), filelist[1])
Dave2e
  • 22,192
  • 18
  • 42
  • 50