8

I have a 400,000 row file with manually entered addresses which need to be geocoded. There's a lot of different variations of the same addresses in the file, so it seems wasteful to be using API calls for the same address multiple times.

To cut down on this, I'd like to reduce these five rows:

    Address
    1 Main Street, Country A, World
    1 Main St, Country A, World
    1 Maine St, Country A, World
    2 Side Street, Country A, World
    2 Side St. Country A, World

down to two:

    Address
    1 Main Street, Country A, World
    2 Side Street, Country A, World

Using the stringdist package you can group the 'word' part of the strings together, but the string matching algorithms don't differentiate between the numbers. This means that it categorises two different houses numbers on the same street as the same address.

To work around this, I came up with two ways of doing it: firsly, trying to manually separate the numbers and the addresses into separate columns using regular expressions and rejoining them afterwards. The problem with this, is that with so many manually entered addresses, there seems to be hundreds of different edge cases and it gets unwieldy.

Using this answer on grouping and this on converting words to numbers, I have a second approach which deals with the edge cases but is incredibly expensive computationally. Is there a better third way of doing this?

library(gsubfn)
library(english)
library(qdap)
library(stringdist)
library(tidyverse)


similarGroups <- function(x, thresh = 0.8, method = "lv"){
  grp <- integer(length(x))
  Address <- x
  x <- tolower(x)
  for(i in seq_along(Address)){
    if(!is.na(Address[i])){
      sim <- stringdist::stringsim(x[i], x, method = method)
      k <- which(sim > thresh & !is.na(Address))
      grp[k] <- i
      is.na(Address) <- k
    }
  }
  grp
}

df <- data.frame(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "1 Maine St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. Country A, World"))

df1 <- df %>%
  # Converts Numbers into Letters
  mutate(Address = replace_number(Address),
         # Groups Similar Addresses Together
         Address = Address[similarGroups(Address, thresh = 0.8, method = "lv")],
         # Converts Letters back into Numbers
         Address = gsubfn("\\w+", setNames(as.list(1:1000), as.english(1:1000)), Address)
  ) %>%
  # Removes the Duplicates
  unique()
Waldi
  • 39,242
  • 6
  • 30
  • 78
rsylatian
  • 429
  • 2
  • 14
  • Do those 5 example rows represent all the variation you will see, or are these other things not shown? – rpolicastro Sep 11 '20 at 01:45
  • No, because it's all manually entered, there's lots of different variations. There's spelling mistakes, numbers at the start, middle and end of addresses, there's letters accompanying numbers, and there's numbers with symbols and letters. The plan was to attack it with regexes, but it got completely unwieldy. – rsylatian Sep 11 '20 at 08:07
  • Mh 400.000 calls with google geocoding api are around 2k $. Might make sense to try some things. But I don't think there is a really easy solution available....are Country A and World in your example always correct? So is it just a variation in the street name? And what about the city name, where is this one included? – Steffen Moritz Sep 21 '20 at 20:14
  • No, they can be different too. They're the same for brevity's sake. – rsylatian Sep 22 '20 at 07:20
  • I would first split the addresses into number, street name, country and world. For this a list of possible country and world entries would be helpful. Do you expect wrong entries for these? Then I would run a clustering algorithm (based on string distance) "by" country and world. – Roland Sep 22 '20 at 13:10
  • Yes, there's lots of incorrect entries. The issue with splitting them is the array of edge cases being thrown up and the fact there's no obvious separators. The above example is a simplified version, as the different countries have different address formats. A quick example of a problem being thrown up is Brazil, where they put the apartment number after the street and different countries have these peculiarities. – rsylatian Sep 23 '20 at 08:31
  • 1
    As a German, I'd say the Brazilian format is obviously the correct one. ;) If you have a list of countries, you can use split rules depending on the country. I don't think you'll have success without splitting. If you just go by similarity of the whole address you will get into trouble because in some countries there are street names that you find in almost each city or village. This can make addresses that only differ by the city name more similar than addresses that should be identical but differ by their format. – Roland Sep 24 '20 at 06:26
  • You are entering the field of (postal) address validation and there are commercial products out there tackling this problem. How do you know that "Main St" and "Maine St" are the same/two different streets? You would require ZIP codes in your data to get proper geo coding lookup results. – Michael Sebald Sep 25 '20 at 10:12
  • As a general rule, there are postal codes there, but it's not included because not all countries use them. – rsylatian Sep 26 '20 at 07:22

2 Answers2

5

stringdist::stringsimmatrix allows to compare similarity between strings:

library(dplyr)
library(stringr)
df <- data.frame(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "3 Main St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. PO 5678 Country A,  World"))
                             
stringdist::stringsimmatrix(df$Address)
          1         2         3         4         5
1 1.0000000 0.8709677 0.8387097 0.8387097 0.5161290
2 0.8518519 1.0000000 0.9629630 0.6666667 0.4444444
3 0.8148148 0.9629630 1.0000000 0.6666667 0.4444444
4 0.8387097 0.7096774 0.7096774 1.0000000 0.6774194
5 0.5833333 0.5833333 0.5833333 0.7222222 1.0000000

As you pointed out, in the example above, row 2 and 3 are very similar according to this criteria (96%), whereas house number is different.

You could add another criteria extracting numbers from the strings, and comparing their similarity :

# Extract numbers
nums <- df %>% rowwise %>% mutate(numlist = str_extract_all(Address,"\\(?[0-9]+\\)?"))  

# Create numbers vectors pairs
numpairs <- expand.grid(nums$numlist, nums$numlist)

# Calculate similarity
numsim <- numpairs %>% rowwise %>% mutate(dist = length(intersect(Var1,Var2)) / length(union(Var1,Var2)))

# Return similarity matrix
matrix(numsim$dist,nrow(df))

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    0  0.0  0.0
[2,]    1    1    0  0.0  0.0
[3,]    0    0    1  0.0  0.0
[4,]    0    0    0  1.0  0.5
[5,]    0    0    0  0.5  1.0

According to this new criteria, rows 2 and 3 are clearly different.

You could combine these two criteria to decide whether addresses are similar enough, for example :

matrix(numsim$dist,nrow(df)) * stringdist::stringsimmatrix(df$Address)

          1         2 3         4         5
1 1.0000000 0.8709677 0 0.0000000 0.0000000
2 0.8518519 1.0000000 0 0.0000000 0.0000000
3 0.0000000 0.0000000 1 0.0000000 0.0000000
4 0.0000000 0.0000000 0 1.0000000 0.3387097
5 0.0000000 0.0000000 0 0.3611111 1.0000000

To deal with many hundred thousands of addresses, expand.grid won't work on the whole dataset, but you could split / parallelize this by country / area in order to avoid an unfeasible full cartesian product.

Waldi
  • 39,242
  • 6
  • 30
  • 78
4

Might want to look into OpenRefine, or the refinr package for R, which is much less visual but still good. It has two functions, key_collision_merge and n_gram_merge which has several parameters. If you have a dictionary of good addresses, you can pass that to key_collision_merge.

Probably good to make note of the abbreviations you see often (St., Blvd., Rd., etc.) and replace all of those. Surely there is a good table somewhere of these abbreviations, like https://www.pb.com/docs/US/pdf/SIS/Mail-Services/USPS-Suffix-Abbreviations.pdf.

Then:

library(refinr)    
df <- tibble(Address = c("1 Main Street, Country A, World", 
                             "1 Main St, Country A, World", 
                             "1 Maine St, Country A, World", 
                             "2 Side Street, Country A, World", 
                             "2 Side St. Country A, World",
                              "3 Side Rd. Country A, World",
                              "3 Side Road Country B World"))
df2 <- df %>%
  mutate(address_fix = str_replace_all(Address, "St\\.|St\\,|St\\s", "Street"),
         address_fix = str_replace_all(address_fix, "Rd\\.|Rd\\,|Rd\\s", "Road")) %>%
  mutate(address_merge = n_gram_merge(address_fix, numgram = 1))

df2$address_merge
[1] "1 Main Street Country A, World"
[2] "1 Main Street Country A, World"
[3] "1 Main Street Country A, World"
[4] "2 Side Street Country A, World"
[5] "2 Side Street Country A, World"
[6] "3 Side Road Country A, World"  
[7] "3 Side Road Country B World"   
ciakovx
  • 334
  • 1
  • 5
  • 2
    The `postmanr` package includes a dictionary for street suffixes. It is optimized for US addresses, so not sure how well it will work for the format you provided (Street, Country, World) but may be worth looking into https://slu-opengis.github.io/postmastr/articles/postmastr.html and https://github.com/slu-openGIS/postmastr `remotes::install_github("slu-openGIS/postmastr") View(dic_us_suffix) ` – ciakovx Sep 25 '20 at 15:21