0

Normally I don't have any issues getting table data from sites, but this one is throwing me for a loop.

I've tried various suggestion from the site: [R: Scraping Site, Incrementing Loop by Date in URL, Saving To CSV [Scraping from aspx website using R [web scraping in R

I've tried the two methods to try and get something from the site and end up with errors.

The first approach:

#####Reading in data
library(RCurl)
library(XML)
library(xts)

#pulling rainfall data csv
direct_rainfall <- read.csv(url(getURL("http://cdec.water.ca.gov  /cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now")))

This ends with the following error: Error in function (type, msg, asError = TRUE) : Failed to connect to cdec.water.ca.gov port 80: Timed out

The second method:

#xml data pull method
require(XML)

url = "http://cdec.water.ca.gov/cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now"
doc = htmlParse(url)

Which end with the following error: Error: failed to load external entity "http://cdec.water.ca.gov/cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now"

Any guidance would be appreciated. I just can't figure out why I'm getting nothing when I try and pull from the URL.

Thanks!

Tyler S
  • 13
  • 2
  • 1
    It's not HTML, it's a CSV. Just use `read.csv('http://cdec.water.ca.gov/cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now', skip = 3, nrows = 17)` – alistaire Jul 26 '16 at 20:35
  • Expletive! That worked. I suck. Thanks! I was definitely over complicating it. – Tyler S Jul 26 '16 at 20:49

1 Answers1

2

If you look at the website, it's a reasonably nicely formatted CSV. Happily, if you pass read.csv a URL, it will automatically handle the connection for you, so all you really need is:

url <- 'http://cdec.water.ca.gov/cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now'

df <- read.csv(url, skip = 3, nrows = 17, na.strings = 'm')

df[1:5,1:10]
##   X.station. X.sensor. X.year. X.month.   X01   X02  X03   X04  X05  X06
## 1        CVT         2    2000       NA 20.90 19.44 3.74  3.31 5.02 0.85
## 2        CVT         2    2001       NA  7.23  9.53 3.86  7.47 0.00 0.15
## 3        CVT         2    2002       NA  3.60  4.43 8.71  2.76 2.78 0.00
## 4        CVT         2    2003       NA  1.71  4.34 4.45 13.45 2.95 0.00
## 5        CVT         2    2004       NA  3.41 10.57 1.80  0.87 0.90 0.00
alistaire
  • 42,459
  • 4
  • 77
  • 117