0

I am after the names of the stores and the coordinates of the stores from: http://contact.woolworths.com.au/storelocator/service/proximity/supermarkets/latitude/-37.7510/longitude/144.8981/range/50/max/200.xml

e.g.
<name>Niddrie</name>
<latitude>-37.737332</latitude>
<longtitude>144.892342</longtitude>

How would I do this? I have tried these:

library(XML)
library(methods)
library(xml2)

#https://stackoverflow.com/questions/17198658/how-to-parse-xml-to-r-data-frame 
data <- xmlParse("http://contact.woolworths.com.au/storelocator/service/proximity/supermarkets/latitude/-37.7510/longitude/144.8981/range/50/max/200.xml")
xml_data <- xmlToList(data)
location <- as.list(xml_data[["storeList"]][["storeRank"]][["storeDetail"]][["Name"]])

#https://www.datacamp.com/community/tutorials/r-data-import-tutorial#xml - not working
xmlfile <- xmlTreeParse("http://contact.woolworths.com.au/storelocator/service/proximity/supermarkets/latitude/-37.7510/longitude/144.8981/range/50/max/200.xml")
class(xmlfile)
topxml <- xmlRoot(xmlfile)
topxml <- xmlSApply(topxml,
                    function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
                     row.names=NULL) 

For both, there is no error, but not the names that I want. Same with the coordinates.

william3031
  • 1,653
  • 1
  • 18
  • 39
  • 1
    Point 21 (h) on https://www.woolworths.com.au/Shop/Discover/about-us/terms-and-conditions suggests this is not ethical or legal to do despite no technical control in https://www.woolworths.com.au/robots.txt unless you're using the API https://developer.woolworths.com.au/ (can you confirm that you're using the API vs scraping?) – hrbrmstr Oct 19 '18 at 13:53
  • I am scraping. Isn't that what people like us do? – william3031 Oct 19 '18 at 14:03
  • I can't see a way to get that XML from scraping (only the API seems to have a "service" that returns XML). And, just because you _can_ scrape does not make ethical or legal to do so. – hrbrmstr Oct 19 '18 at 14:04
  • Noted. Thanks for the advice. – william3031 Oct 19 '18 at 14:06

1 Answers1

1

Looks like that XML can only come from the API. It does have a namespace so that's likely what's causing you problems. We'll just remove it.

library(xml2)

xml_ns_strip(
  doc <- read_xml("http://contact.woolworths.com.au/storelocator/service/proximity/supermarkets/latitude/-37.7510/longitude/144.8981/range/50/max/200.xml")
) -> doc

data.frame(
  name = xml_text(xml_find_all(doc, ".//storeDetail/name")),
  lng = xml_double(xml_find_all(doc, ".//storeDetail/longtitude")),
  lat = xml_double(xml_find_all(doc, ".//storeDetail/latitude")),
  stringsAsFactors = FALSE
) -> stores

str(stores)
## 'data.frame': 188 obs. of  3 variables:
##  $ name: chr  "Niddrie" "Highpoint West" "Moonee Ponds" "East Keilor" ...
##  $ lng : num  145 145 145 145 145 ...
##  $ lat : num  -37.7 -37.8 -37.8 -37.7 -37.7 ...

For those still using the XML :

library(XML)

doc <- xmlParse("http://contact.woolworths.com.au/storelocator/service/proximity/supermarkets/latitude/-37.7510/longitude/144.8981/range/50/max/200.xml")

def <- c(d = getDefaultNamespace(doc)[[1]]$uri)

data.frame(
  name = xpathSApply(doc, "//d:storeDetail/d:name", xmlValue, namespaces = def),
  lng = as.numeric(xpathSApply(doc, "//d:storeDetail/d:longtitude", xmlValue, namespaces = def)),
  lat = as.numeric(xpathSApply(doc, "//d:storeDetail/d:latitude", xmlValue, namespaces = def)),
  stringsAsFactors = FALSE
) -> stores
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205