Replacing strings using fuzzywuzzyR

Question

I have a large data set with city names. Many of the names are not consistent.

Example:

vec = c("New York", "New York City", "new York CIty", "NY", "Berlin", "BERLIn", "BERLIN", "London", "LONDEN", "Lond", "LONDON")

I want to use fuzzywuzzyR to bring them into a consistent format. The problem is I a have no master list of the original city names.

This package provides the possibility to detect duplicates like this:

library(fuzzywuzzyR)

init_proc = FuzzUtils$new() 
PROC = init_proc$Full_process   
init_scor = FuzzMatcher$new()    
SCOR = init_scor$WRATIO         
init = FuzzExtract$new()

init$Dedupe(contains_dupes = vec, threshold = 70L, scorer = SCOR)

dict_keys(['New York City', 'NY', 'BERLIN', 'LONDEN'])

Or I can set a "master value" like this:

master = "London"

init$Extract(string = master, sequence_strings = vec, processor = PROC, scorer = SCOR)

[[1]]
[[1]][[1]]
[1] "London"

[[1]][[2]]
[1] 100


[[2]]
[[2]][[1]]
[1] "LONDON"

[[2]][[2]]
[1] 100


[[3]]
[[3]][[1]]
[1] "Lond"

[[3]][[2]]
[1] 90


[[4]]
[[4]][[1]]
[1] "LONDEN"

[[4]][[2]]
[1] 83


[[5]]
[[5]][[1]]
[1] "NY"

[[5]][[2]]
[1] 45

My question is how can I use this to replace all matches in the list with the same value i.e. I would like to replace all values that match the master value with "London". However, I don´t have the master values. So, I need to have a list of matches and replace the values. In this case it would be "New York", "London" "Berlin". After the process, vec should looklike this.

new_vec = c("New York", "New York", "New York", "New York", "Berlin", "Berlin", "Berlin", "London", "London", "London", "London")

Update

@camille came up with the idea of using world.cities of the maps package. I found this post using fuzzyjoin dealing with a similar problem.

To use this I convert vec to a data frame.

vec = as.data.frame(vec, stringsAsFactors = F) 
colnames(vec) = c("City")

Then using the fuzzyjoin package together with world.cities of the maps package.

library(maps)
library(fuzzyjoin)

vec %>%
  stringdist_left_join(world.cities, by = c(City = "name"), distance_col = "d") %>%
  group_by(City) %>%
  top_n(1)

The output looks like this:

# A tibble: 50 x 3
# Groups:   City [5]
   City     name         d
   <chr>    <chr>    <dbl>
 1 New York New York     0
 2 NY       Ae           2
 3 NY       Al           2
 4 NY       As           2
 5 NY       As           2
 6 NY       As           2
 7 NY       Au           2
 8 NY       Ba           2
 9 NY       Bo           2
10 NY       Bo           2
# ... with 40 more rows

The Problem is that I have no Idea how to use the distance between ´nameandCity` to change the misspelled values into the right ones for all cities. In theory the corret value must be the closest one. But i.e. for NY this not the case.

I'm guessing in your full data there are too many cities / potentially changing cities to just make your own reference list? Or if the cities aren't too small or obscure, you might use the `world.cities` dataset that ships with the `maps` package---it's got about 44,000 cities and might work as a reference. But again, that depends on the scale of what you're working on. — camille, Mar 01 '19 at 15:04

Replacing strings using fuzzywuzzyR

0 Answers0