4

I need to fuzzy match and get the distance between the zip / address inin two distint dataset.

Here below an example:

name_a <- c("Aldo", "Andrea", "Alberto", "Antonio", "Angelo")
name_b <- c("Sara", "Serena", "Silvia", "Sonia", "Sissi")

zip_street_a <- c("1204 Roma Street 8", "1204 Roma Street 8", "1204 Roma Street 8", "1204 Venezia street 10", "1204 Venezia Street 110")

zip_street_b <- c("1204 Roma Street 81", "1204 Roma Street 8A", "1204 Roma Street 8B", "1204 Roma Street 8C", "1204 Venezia Street 10C")

db_a <- data.frame(name_a, zip_street_a)
db_b <- data.frame(name_b, zip_street_b)

names(db_a)[names(db_a)=='zip_street_a'] <- 'zipstreet'
names(db_b)[names(db_b)=='zip_street_b'] <- 'zipstreet'

Now I used library(fuzzyjoin) in combinatin with library(dplyr) to create the following script:

match_data <- stringdist_left_join(db_a, db_b,
              by = "zipstreet",
              ignore_case = TRUE,
              method = "jaccard",
              max_dist = 1,
              distance_col = "dist"
) %>%
  Group_by(zipstreet.x)

The script works fine. But I would like to have different distance between the following address combinations:

a) 1204 Roma Street 8 vs. 1204 Roma Street 81 --> distance = 0.0147
b) 1204 Roma Street 8 vs. 1204 Roma Street 8A --> distance = 0.0147

Now, Roma Street number 81 is very far from Roma Street 8. On the other hand, Roma Street number 8A is very close to Roma Street number 8.

So, I need to have a distance very close to 0 for 8A, and far from 0 for 81.

How is it possibile to do that?

claudia
  • 81
  • 7
  • 1
    but someone from "1204 Doma Street 8" street would still be quite close, I'm not sure you'll get anywhere with this approach... from `ggmap` package you'll find a `geocode` function that will give you gps coordinate, then you can compute actual distances – moodymudskipper Jul 12 '18 at 14:41
  • 1
    `db_a$zipstreet2 <- gsub("\\D+$","",db_a$zipstreet)` will get rid off the last characters that are not numbers, maybe it'll work as a quick fix – moodymudskipper Jul 12 '18 at 14:45

1 Answers1

0

The distance is based on string match, i.e. fuzzymatch. But you talk about the physical distance between the two addresses?

In that case you need to gather longitude and latitude data based on each adress.

Mattias99
  • 1
  • 1
  • ok, I will separate street name from street number. But is it not so easy because for exampe I have to exctract from: richtistrasse 7Aopnly the part 7A. I am using gsub("^[:digit:]]", "", mydata), but it take only 7 and not 7A. is there a solution in your opiniio? – claudia Jul 23 '18 at 13:26
  • 1
    Try this, street <- "richtistrasse 7A" gsub("\\s*\\w*$", "", street) – Mattias99 Jul 24 '18 at 18:10