-5

How can I download the content of a webpage and find all files with specific extension listed on it. And then download all of them. For example, I would like to download all netcdf files (with extension *.nc4) from the following webpage: https://data.giss.nasa.gov/impacts/agmipcf/agmerra/.

I was recommended to look into Rcurl package but could not find how to do this.

Katia
  • 3,784
  • 1
  • 14
  • 27
89_Simple
  • 3,393
  • 3
  • 39
  • 94
  • 2
    Have you tried to write any code? Please read [How to create a Minimal, Complete, and Verifiable Example](https://stackoverflow.com/help/mcve) and update your post. For example, in looking at the web page, your question might be about scraping the web page to retrieve all the file names, or it might be about how to download a set of files, or maybe both, or neither. A verifiable example will make it easier for people to help you. – Len Greski May 03 '18 at 22:37
  • 3
    Here is the [reference manual](https://cran.r-project.org/web/packages/RCurl/RCurl.pdf) and a [SO post](https://stackoverflow.com/questions/23028760/download-a-file-from-https-using-download-file) you may want to look at. Let us know after you've tried some of the codes and if it still doesn't work. – Kim May 03 '18 at 22:40
  • Thank you. That's useful references. I will have a read. – 89_Simple May 04 '18 at 09:22

1 Answers1

1
library(stringr)

# Get the context of the page
thepage = readLines('https://data.giss.nasa.gov/impacts/agmipcf/agmerra/')

# Find the lines that contain the names for netcdf files
nc4.lines <- grep('*.nc4', thepage) 

# Subset the original dataset leaving only those lines
thepage <- thepage[nc4.lines]

#extract the file names
str.loc <- str_locate(thepage,'A.*nc4?"')

#substring
file.list <- substring(thepage,str.loc[,1], str.loc[,2]-1)

# download all files
for ( ifile in file.list){
 download.file(paste0("https://data.giss.nasa.gov/impacts/agmipcf/agmerra/",
                      ifile),
               destfile=ifile, method="libcurl")
Katia
  • 3,784
  • 1
  • 14
  • 27
  • I get the following error message `trying URL 'https://data.giss.nasa.gov/impacts/agmipcf/agmerra/AgMERRA_2010_wndspd.nc' Error in download.file(paste0("https://data.giss.nasa.gov/impacts/agmipcf/agmerra/", : cannot open URL 'https://data.giss.nasa.gov/impacts/agmipcf/agmerra/AgMERRA_2010_wndspd.nc' In addition: Warning message: In download.file(paste0("https://data.giss.nasa.gov/impacts/agmipcf/agmerra/", : URL 'https://data.giss.nasa.gov/impacts/agmipcf/agmerra/AgMERRA_2010_wndspd.nc': status was 'SSL connect error'` – 89_Simple May 04 '18 at 09:25
  • Can you try file.list <- substring(thepage,str.loc[,1], str.loc[,2]-1) instead – Katia May 04 '18 at 09:36
  • hmm. I still have this issue. But I get the gist of how to use this function to download files from internet so I will accept this answer. – 89_Simple May 04 '18 at 12:00