Objective: Using R, get latitude and longitude data for a vector of addresses through open.mapquestapi
Point of departure: Since geocode
from the ggmap
package is restricted to 2500 queries a day, I needed to find a different way (My data.frame consists of 9M entries). The data-science toolkit is not an option since most of my addresses are based outside the UK/US. I found this excellent snippet on http://rpubs.com/jvoorheis/Micro_Group_Rpres utilizing open.mapquestapi.
geocode_attempt <- function(address) {
URL2 = paste("http://open.mapquestapi.com/geocoding/v1/address?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz",
"&location=", address, "&outFormat='json'", "boundingBox=24,-85,50,-125",
sep = "")
# print(URL2)
URL2 <- gsub(" ", "+", URL2)
x = getURL(URL2)
x1 <- fromJSON(x)
if (length(x1$results[[1]]$locations) == 0) {
return(NA)
} else {
return(c(x1$results[[1]]$locations[[1]]$displayLatLng$lat, x1$results[[1]]$locations[[1]]$displayLatLng$lng))
}
}
geocode_attempt("1241 Kincaid St, Eugene,OR")
We need these libraries:
library(RCurl)
library(rjson)
library(dplyr)
Let's create a mock-up data.frame with 5 adresses.
id <- c(seq(1:5))
street <- c("Alexanderplatz 10", "Friedrichstr 102", "Hauptstr 42", "Bruesseler Platz 2", "Aachener Str 324")
postcode <- c("10178","10117", "31737", "50672", "50931")
city <- c(rep("Berlin", 2), "Rinteln", rep("Koeln",2))
country <- c(rep("DE", 5))
df <- data.frame(id, street, postcode, city, country
For a adding a latitude lat
and longitude lon
variable to the data.frame we could work with a for
-Loop. I will present the code, just to demonstrate that the function works in principle.
for(i in 1:5){
df$lat[i] <- geocode_attempt(paste(df$street[i], df$postcode[i], df$city[i], df$country[i], sep=","))[1]
df$lon[i] <- geocode_attempt(paste(df$street[i], df$postcode[i], df$city[i], df$country[i], sep=","))[2]
}
From a performance standpoint, this code is pretty bad. Even for this small data.frame, my computer took about 9sec, most probably due to the webservice query, but never mind. So I could run this code on my 9M rows, but the time would be tremendous.
My attempt was to utilize the mutate
function from the dplyr
package.
Here is what I tried:
df %>%
mutate(lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])
system.time
stops in only 2.3 seconds. Not too bad. But here is the problem:
id street postcode city country lat lon
1 1 Alexanderplatz 10 10178 Berlin DE 52.52194 13.41348
2 2 Friedrichstr 102 10117 Berlin DE 52.52194 13.41348
3 3 Hauptstr 42 31737 Rinteln DE 52.52194 13.41348
4 4 Bruesseler Platz 2 50672 Koeln DE 52.52194 13.41348
5 5 Aachener Str 324 50931 Koeln DE 52.52194 13.41348
lat
and lon
are exactly the same for all entries. In my understanding, the mutate
function is working rowwise. But here, lat and lon are the one calculated from the first row. Accordingly, the first row is correct. Does anyone have an idea why? The code I provided is complete. Nothing extra loaded. Any ideas? If you have a performant alternative way instead of an optimization of my code, I would be grateful as well.