0

I would like to scrape the historical weather data from this page http://www.weather.gov.sg/climate-historical-daily.

I am using the code given in this link Using r to navigate and scrape a webpage with drop down html forms.

However, I am not able to get the data probably due to change in structure of the page. In the code from the above link pgform <-html_form(pgsession)[[3]] was used to change the values of the form. I was not able to find a similar form in my case.

url <- "http://www.weather.gov.sg/climate-historical-daily"
pgsession <- html_session(url)
pgsource <- read_html(url)
pgform <- html_form(pgsession)

result in my case

> pgform
[[1]]
<form> 'searchform' (GET http://www.weather.gov.sg/)
<button submit> '<unnamed>
<input text> 's': 
Community
  • 1
  • 1
challa420
  • 260
  • 1
  • 12
  • That's just getting the search box, not the actual controls, which are not in a `
    ` tag and thus can't be handled by `html_form`. You'd probably need RSelenium. The page does have nice CSV download links, though, which seem to follow a pattern and could thus probably be downloaded directly with `download.file` provided you can figure out which ones you need.
    – alistaire Apr 26 '17 at 07:01
  • Thank you and I agree with you that the page has download links. But I need data for last 3 years data, for all the stations listed in the dropdown. I thought if I can figure out this part I can write a loop to get the data. – challa420 Apr 26 '17 at 07:08

1 Answers1

0

Since the page has a CSV download button and the links it provides follow a pattern, you can generate and download a set of URLs. You'll need a set of the station IDs, which you can scrape from the dropdown itself:

library(rvest)

page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html()

station_id <- page %>% html_nodes('button#cityname + ul a') %>% 
    html_attr('onclick') %>%    # If you need names, grab the `href` attribute, too.
    sub(".*'(.*)'.*", '\\1', .)

which can then be put into expand.grid with the months and years to generate all the necessary combinations:

df <- expand.grid(station_id, 
                  month = sprintf('%02d', 1:12),
                  year = 2014:2016)

(Note if you want 2017 data, you'll need to construct those separately and rbind so as not to construct months that haven't happened yet.)

The combinations can then be paste0ed into URLs:

urls <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_', 
               df$station_id, '_', df$year, df$month, '.csv')

which can be lapplyed across to download all the files:

# Warning! This will download a lot of files! Make sure you're in a clean directory.    
lapply(urls, function(url){download.file(url, basename(url), method = 'curl')})
alistaire
  • 42,459
  • 4
  • 77
  • 117