3

I am trying to read in an XML from the web located at: https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml

I am getting the following error in R:

Error: XML content does not seem to be XML: 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml'

My code:

install.packages("XML")
library(XML)
fileURL = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
doc = xmlTreeParse(fileURL)

I want to read that XML file and find out how many restaurants have zipcode 21231?

Thanks

Shery
  • 1,808
  • 5
  • 27
  • 51
  • Check the documentation of that function. http://cran.r-project.org/web/packages/XML/XML.pdf I guess you need to populate the `isUrl` parameter properly. – hek2mgl Jun 10 '14 at 11:03
  • Did try this but didnt work...errors: failed to load external entity "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml" Error: 1: failed to load external entity "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml" – Shery Jun 10 '14 at 11:10

1 Answers1

6

Try downloading the xml file:

library(XML)
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
download.file(fileURL, destfile=tf <- tempfile(fileext=".xml"))
doc <- xmlParse(tf)
zip <- xpathSApply(doc, "/response/row/row/zipcode", xmlValue)
sum(zip == "21231")
# [1] 127
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • 1
    Can you explain why? The documentation states that urls are valid. – hek2mgl Jun 10 '14 at 11:17
  • 2
    They are valid, but I guess the certificate verification for https fails. I don't know if you can pass `ssl.verifypeer = FALSE` to the underlying `RCurl::getURL`(?). But `download.file` or `readLines` or `RCurl::getURL(..., ssl.verifypeer = FALSE)` or even exchanging `https` by `http` work. – lukeA Jun 10 '14 at 11:38
  • Sounds reasonable. Unfortunately I cannot test it atm. – hek2mgl Jun 10 '14 at 11:40