12

Objective: Using R, get latitude and longitude data for a vector of addresses through open.mapquestapi

Point of departure: Since geocode from the ggmap package is restricted to 2500 queries a day, I needed to find a different way (My data.frame consists of 9M entries). The data-science toolkit is not an option since most of my addresses are based outside the UK/US. I found this excellent snippet on http://rpubs.com/jvoorheis/Micro_Group_Rpres utilizing open.mapquestapi.

geocode_attempt <- function(address) {
    URL2 = paste("http://open.mapquestapi.com/geocoding/v1/address?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz", 
        "&location=", address, "&outFormat='json'", "boundingBox=24,-85,50,-125", 
        sep = "")
    # print(URL2)
    URL2 <- gsub(" ", "+", URL2)
    x = getURL(URL2)
    x1 <- fromJSON(x)
    if (length(x1$results[[1]]$locations) == 0) {
        return(NA)
    } else {
        return(c(x1$results[[1]]$locations[[1]]$displayLatLng$lat, x1$results[[1]]$locations[[1]]$displayLatLng$lng))
    }
}
geocode_attempt("1241 Kincaid St, Eugene,OR")

We need these libraries:

library(RCurl)
library(rjson)
library(dplyr)

Let's create a mock-up data.frame with 5 adresses.

id <- c(seq(1:5))
street <- c("Alexanderplatz 10", "Friedrichstr 102", "Hauptstr 42", "Bruesseler Platz 2", "Aachener Str 324")
postcode <- c("10178","10117", "31737", "50672", "50931")
city <- c(rep("Berlin", 2), "Rinteln", rep("Koeln",2))
country <- c(rep("DE", 5))

df <- data.frame(id, street, postcode, city, country

For a adding a latitude lat and longitude lon variable to the data.frame we could work with a for-Loop. I will present the code, just to demonstrate that the function works in principle.

for(i in 1:5){
  df$lat[i] <- geocode_attempt(paste(df$street[i], df$postcode[i], df$city[i], df$country[i], sep=","))[1]
  df$lon[i] <- geocode_attempt(paste(df$street[i], df$postcode[i], df$city[i], df$country[i], sep=","))[2]
}

From a performance standpoint, this code is pretty bad. Even for this small data.frame, my computer took about 9sec, most probably due to the webservice query, but never mind. So I could run this code on my 9M rows, but the time would be tremendous.

My attempt was to utilize the mutate function from the dplyr package. Here is what I tried:

df %>%
  mutate(lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
        lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])

system.time stops in only 2.3 seconds. Not too bad. But here is the problem:

 id             street postcode    city country      lat      lon
1  1  Alexanderplatz 10    10178  Berlin      DE 52.52194 13.41348
2  2   Friedrichstr 102    10117  Berlin      DE 52.52194 13.41348
3  3        Hauptstr 42    31737 Rinteln      DE 52.52194 13.41348
4  4 Bruesseler Platz 2    50672   Koeln      DE 52.52194 13.41348
5  5   Aachener Str 324    50931   Koeln      DE 52.52194 13.41348

lat and lon are exactly the same for all entries. In my understanding, the mutate function is working rowwise. But here, lat and lon are the one calculated from the first row. Accordingly, the first row is correct. Does anyone have an idea why? The code I provided is complete. Nothing extra loaded. Any ideas? If you have a performant alternative way instead of an optimization of my code, I would be grateful as well.

Daniel Schultz
  • 320
  • 2
  • 13
  • how did the query provided by @NicE end up working for your 9M rows? were you able to geocode all instances in a relatively small amount of time, or did you hit a limitation with MapQuest? – bshelt141 May 12 '17 at 11:57

3 Answers3

10

You might need to vectorize your geocode_attempt function to do it columnwise:

vecGeoCode<-Vectorize(geocode_attempt,vectorize.args = c('address'))

And then call:

df %>%
        mutate(lat = vecGeoCode(paste(street, postcode, city, country, sep=","))[1,],
               lon =vecGeoCode(paste(street, postcode, city, country, sep=","))[2,])

To speed thing up, you might want to look at the batch mode of the API to get up to 100 lats and longs in one go.

To use the API's batch requests you could use this function:

geocodeBatch_attempt <- function(address) {
  #URL for batch requests
  URL=paste("http://open.mapquestapi.com/geocoding/v1/batch?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz", 
             "&location=", paste(address,collapse="&location="),sep = "") 

  URL <- gsub(" ", "+", URL)
  data<-getURL(URL)
  data <- fromJSON(data)

  p<-sapply(data$results,function(x){
    if(length(x$locations)==0){
      c(NA,NA)
    } else{
      c(x$locations[[1]]$displayLatLng$lat, x$locations[[1]]$displayLatLng$lng)   
    }})
  return(t(p))
}

To test it:

#make a bigger df from the data (repeat the 5 lines 25 times)
biggerDf<-df[rep(row.names(df), 25), ]

#add a reqId column to split the data in batches of 100 requests 
biggerDf$reqId<-seq_along(biggerDf$id)%/%100

#run the function, first grouping by reqId to send batches of 100 requests
biggerDf %>%
  group_by(reqId) %>%
  mutate(lat = geocodeBatch_attempt(paste(street, postcode, city, country, sep=","))[,1],
         lon =geocodeBatch_attempt(paste(street, postcode, city, country, sep=","))[,2])
NicE
  • 21,165
  • 3
  • 51
  • 68
  • How would I need to change the function? I guess just changing URL2 won't work :) Vectorizing the function works and is slightly faster than the `group_by` and `rowwise` options – Daniel Schultz Feb 24 '15 at 09:45
  • yes, I edited my answer to add a modified version of the geocode_attempt that would handle batch requests. On 125 rows, it's ~ twice as fast. – NicE Feb 24 '15 at 12:38
  • Works like a charm...since it is an API call it is not very fast, but works. Anyone has an idea what the query limit for mapquest is? – Daniel Schultz Feb 25 '15 at 09:40
4

It's really easy to look at mutate() and draw the conclusion that what's happening is similar to what you illustrate in your for loop - but what you're actually seeing there is just a vectorized R function which acting on the entire column of the data frame.

I would not be surprised if others had this misconception - the dplyr tutorials don't address the distinction between vectorized/non-vectorized functions, and (even more dangerous) R's recycling rules mean that applying a scalar function won't necessarily raise an error. There's some more discussion of this here.

One option is to rewrite your geocode_attempt so that it can take a vector of addresses.

If you want to keep your function as is, but want dplyr to behave more like something from the -ply family you have two potential approaches:

The first is to use the grouping variable you have in your data:

df %>%
  group_by(id) %>%
  mutate(
    lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
    lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])

The second is to use rowwise() function described in this answer.

df %>%
  rowwise() %>%
  mutate(
    lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
    lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])

The group_by solution is significantly faster on my machine. Not sure why!

Unfortunately the speed savings you are seeing from dplyr above are likely somewhat illusory - most likely the result of the geocoding function getting called only once (vs once per row in the loop). There may well be gains, but you'll need to run the timmings again.

Community
  • 1
  • 1
Andrew
  • 9,090
  • 8
  • 46
  • 59
0

There's a geocoding package using Nokia HERE service. It has a batch mode. You can use it with the test API keys and you may not hit a limit. Worth a look...

cory
  • 6,529
  • 3
  • 21
  • 41
  • What would be the batch limit for geocodeHERE? Am I correct when I assume that you are the developer? – Daniel Schultz Feb 24 '15 at 09:56
  • Limit is 10K, but they allow the use of their default keys seemingly without limit. Yes, it's just a simple wrapper for the API. The batch functionality is kinda nice though. – cory Feb 24 '15 at 15:20