I have two data tables which contain city names. The first mydf
contains a list of cities we want to check. It consists of 18990 records. The second is a reference table i have been given which contains 353766 rows.
The structure of both tables is below (head 10)
structure(list(country = "LT", city = "VILNIUS", cleaned_city = "VILNIUS"), .Names = c("country",
"city", "cleaned_city"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L))
Table myref can contain cities that don't exist in mydf
as its our reference table. Table mydf
can contain cities that don't exits in myref
as we are trying to determine what is missing from our reference table.
Table mydf
may contain cities that are spelt subtly different to what we have in myref
.
The cities exist from many countries so using for loops I implemented the following logic:
- loop through every country in
mydf
- loop through every city in
mydf
for the country in loop 1 and implement a levenschtein match between the same country in the reference table (different countries can have similiar towns so this is the reason for the first loop) - record the %match and the most similar town in the
mydf
table for manual examination after the code has run.
I have tried creating a second column in the mydf table with every possible match to a city within that country and then run the levenschtein match but i ran out of memory (the table is too large and im on 32 bit windows laptop) which is why i went back to the for loops which is taking days to run. Can anyone help. My code is below (i know for loops are probably the least optimal way to approach this so this should be a great learning experience) and if any more information is needed please let me know
# Initialize variables, my.country contains all the unique countries in mydf
my.country <- unique(mydf$country)
mydf.sample <- mydf[0, ]
myref.sample <- myref[0, ]
mydf.final <- mydf[0, ] %>%
mutate(levdist = 0,
town.match = '')
# For each country, Take each item in mydf, compare it to every record in the reference table myref
# get the best levenschtein match and score
# add the levenschtein score and the city matched to mydf
for(intcountry in 1:length(my.country))
{
# Filter the mydf Table & myRef Table to specific countries based on the intcountry iteration
mydf.sample <- mydf %>%
filter(country == my.country[intcountry])
myref.sample <- mydf %>%
filter(country == my.country[intcountry])
# Inititalize the temp vector to the size of the mydf size
vector <- character(length(mydf.sample))
# Set Up Levenschtein Distance
# For every Record in the Dataframe to be checked
for(item in 1:nrow(mydf.sample))
{
# For every Record in the Reference Table
for(k in 1:nrow(myref))
{
vector[k] = levenshteinSim(mydf.sample$City[item],myref$city[k])
}
# Get index of the highest levenschtein match
max.match.index = match(max(vector),vector)
mydf.sample$levdist[item] = max(vector)
mydf.sample$town.match[item] = myref$city[max.match.index]
vector <- character(length(mydf.sample))
}
mydf.final<- rbind(mydf.sample, mydf.final)
}