1

Edit: It looks like this is a known issue with the "cascade" method. Results that return NA values after the first attempt don't like being converted to doubles when subsequent methods return lat/lons.

Data: I have a list of addresses that I need to geocode. I'm using lapply() to split-apply-combine, which works, but very slowly. My thought to split (further)-apply-combine is returning errors about dim names and sizes that are confusing to me.

# example data 
library(dplyr)
library(tidygeocoder)

url <- "https://www.briandunning.com/sample-data/us-500.zip"
download.file(url = url, destfile = basename(url))

adds <- readr::read_csv(basename(url)) %>%
  select(address, city, 
         county, state, zip) %>%
  mutate(date = seq.Date(as.Date('2015-01-01'), to = Sys.Date(), length.out = 500)) %>%
  mutate(year = lubridate::year(date)) %>%
  # to keep it small 
  sample_n(20)

This works, split addresses by year, apply tidygeocoder function to return lat/lons, and recombine.

adds_by_year <- adds %>% split(.$year)
geo_list <- lapply(adds_by_year, function(x) {
  geo <-  geocode(.tbl = x,
                      street = address,
                      city = city,
                      county = county,
                      state = state,
                      postalcode = zip,
                      # cascade method uses all options (census, osm, etc)
                      # takes longer but may be more accurate
                      method = "cascade", timeout = 500) %>%
    filter(!is.na(lat))
  return(geo)
})

out <- bind_rows(geo_list)

Below does not:

adds <- adds %>%
  mutate(yrmn = zoo::as.yearmon(date))

adds_by_yrm <- adds %>% split(.$yrmn)
geo_list <- lapply(adds_by_yrm, function(x) {
  geo <-  geocode(.tbl = x,
                  street = address,
                  city = city,
                  county = county,
                  state = state,
                  postalcode = zip,
                  # cascade method uses all options (census, osm, etc)
                  # takes longer but may be more accurate
                  method = "cascade", timeout = 500) %>%
    filter(!is.na(lat))
  return(geo)
})

out <- bind_rows(geo_list)

Returns this error:

 Error: Assigned data `retry_results` must be compatible with existing data.
ℹ Error occurred for column `lat`.
x Can't convert from <double> to <logical> due to loss of precision.
* Locations: 1.
Run `rlang::last_error()` to see where the error occurred.
 

I did some searching and found this, but the proposed solution -- wrapping x in as.data.frame(), resulted in the same error. Any insight is appreciated. I've looked into using purrr but I'm not sure I grok completely.

Here is the full backtrace, which I'm not familiar enough with to parse completely:

Backtrace:
     █
  1. ├─base::lapply(...)
  2. │ └─global::FUN(X[[i]], ...)
  3. │   └─tidygeocoder::geocode(...)
  4. │     ├─base::do.call(geo, geo_args)
  5. │     └─(function (address = NULL, street = NULL, city = NULL, county = NULL, ...
  6. │       ├─base::do.call(geo_cascade, all_args[!names(all_args) %in% c("method")])
  7. │       └─(function (..., cascade_order = c("census", "osm")) ...
  8. │         ├─base::`[<-`(...)
  9. │         └─tibble:::`[<-.tbl_df`(...)
 10. │           └─tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
 11. │             └─tibble:::tbl_subassign_row(x, i, value, value_arg)
 12. │               ├─base::withCallingHandlers(...)
 13. │               └─vctrs::`vec_slice<-`(`*tmp*`, i, value = value[[j]])
 14. │                 └─(function () ...
 15. │                   └─vctrs:::vec_cast.logical.double(...)
 16. │                     └─vctrs::maybe_lossy_cast(out, x, to, lossy, x_arg = x_arg, to_arg = to_arg)
 17. │                       ├─base::withRestarts(...)
 18. │                       │ └─base:::withOneRestart(expr, restarts[[1L]])
 19. │                       │   └─base:::doWithOneRestart(return(expr), restart)
 20. │                       └─vctrs:::stop_lossy_cast(...)
 21. │                         └─vctrs:::stop_vctrs(...)
 22. │                           └─rlang::abort(message, class = c(class, "vctrs_error"), ...)
 23. │                             └─rlang:::signal_abort(cnd)
 24. │                               └─base::signalCondition(cnd)
 25. └─(function (cnd) ...
Francisco
  • 169
  • 1
  • 9

2 Answers2

1

It is working with dplyr 1.0.6

dplyr::bind_rows(geo_list)
# A tibble: 8 x 11
  address             city       county               state zip   date        year yrmn        lat  long geo_method
  <chr>               <chr>      <chr>                <chr> <chr> <date>     <dbl> <yearmon> <dbl> <dbl> <chr>     
1 134 Lewis Rd        Nashville  Davidson             TN    37211 2016-11-06  2016 Nov 2016   36.2 -86.8 osm       
2 6651 Municipal Rd   Houma      Terrebonne           LA    70360 2017-02-03  2017 Feb 2017   29.6 -90.7 osm       
3 189 Village Park Rd Crestview  Okaloosa             FL    32536 2017-08-25  2017 Aug 2017   30.8 -86.6 osm       
4 9122 Carpenter Ave  New Haven  New Haven            CT    06511 2018-01-14  2018 Jan 2018   41.5 -72.8 osm       
5 5221 Bear Valley Rd Nashville  Davidson             TN    37211 2018-09-17  2018 Sep 2018   36.1 -86.8 osm       
6 28 S 7th St #2824   Englewood  Bergen               NJ    07631 2020-03-31  2020 Mar 2020   40.9 -74.0 census    
7 5 E Truman Rd       Abilene    Taylor               TX    79602 2021-02-25  2021 Feb 2021   32.5 -99.7 osm       
8 9 Front St          Washington District of Columbia DC    20001 2021-05-16  2021 May 2021   38.9 -77.0 osm   

Noticed that there are some list elements having 0 rows. Maybe, we could remove those 0 row elements and then use bind_rows

library(purrr)
library(dplyr)
geo_list %>%
    keep(~ NROW(.x) > 0) %>% 
    bind_rows
# A tibble: 8 x 11
  address             city       county               state zip   date        year yrmn        lat  long geo_method
  <chr>               <chr>      <chr>                <chr> <chr> <date>     <dbl> <yearmon> <dbl> <dbl> <chr>     
1 134 Lewis Rd        Nashville  Davidson             TN    37211 2016-11-06  2016 Nov 2016   36.2 -86.8 osm       
2 6651 Municipal Rd   Houma      Terrebonne           LA    70360 2017-02-03  2017 Feb 2017   29.6 -90.7 osm       
3 189 Village Park Rd Crestview  Okaloosa             FL    32536 2017-08-25  2017 Aug 2017   30.8 -86.6 osm       
4 9122 Carpenter Ave  New Haven  New Haven            CT    06511 2018-01-14  2018 Jan 2018   41.5 -72.8 osm       
5 5221 Bear Valley Rd Nashville  Davidson             TN    37211 2018-09-17  2018 Sep 2018   36.1 -86.8 osm       
6 28 S 7th St #2824   Englewood  Bergen               NJ    07631 2020-03-31  2020 Mar 2020   40.9 -74.0 census    
7 5 E Truman Rd       Abilene    Taylor               TX    79602 2021-02-25  2021 Feb 2021   32.5 -99.7 osm       
8 9 Front St          Washington District of Columbia DC    20001 2021-05-16  2021 May 2021   38.9 -77.0 osm       
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    thank you, I've added a filed issue from the tidycensus git. I'll try your method and update my dplyr. – Francisco Jun 14 '21 at 16:18
0

SOLVED:

  1. update dplyr (thanks to akrun)
  2. update tidygeocoder-- turns out the issue was bind_rows numeric results to NA results, which was dealt with in a newer release, which I didn't have yet. Posting my code here because there are several useful flags in the geocode() function for debugging:
adds_by_yrm <- adds %>% split(.$yrmn)
geo_list <- lapply(adds_by_yrm, function(x) {
  geo <-  geocode(.tbl = as.data.frame(x),
                  street = address,
                  city = city,
                  county = county,
                  state = state,
                  postalcode = zip,
                  # cascade method uses all options (census, osm, etc)
                  # takes longer but may be more accurate
                  method = "cascade", 
                  cascade_order = c("census", "osm"), 
                  timeout = 500, 
                  unique_only = TRUE,
                  verbose = T) %>%
    filter(!is.na(lat))
    
  return(geo)
})

out <- geo_list %>%
  purrr::keep(~ NROW(.x) > 0) %>% 
  bind_rows()
Francisco
  • 169
  • 1
  • 9