2

I have two data tables which contain city names. The first mydf contains a list of cities we want to check. It consists of 18990 records. The second is a reference table i have been given which contains 353766 rows.

The structure of both tables is below (head 10)

structure(list(country = "LT", city = "VILNIUS", cleaned_city = "VILNIUS"), .Names = c("country", 
"city", "cleaned_city"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L))

Table myref can contain cities that don't exist in mydf as its our reference table. Table mydf can contain cities that don't exits in myref as we are trying to determine what is missing from our reference table. Table mydf may contain cities that are spelt subtly different to what we have in myref.

The cities exist from many countries so using for loops I implemented the following logic:

  1. loop through every country in mydf
  2. loop through every city in mydf for the country in loop 1 and implement a levenschtein match between the same country in the reference table (different countries can have similiar towns so this is the reason for the first loop)
  3. record the %match and the most similar town in the mydf table for manual examination after the code has run.

I have tried creating a second column in the mydf table with every possible match to a city within that country and then run the levenschtein match but i ran out of memory (the table is too large and im on 32 bit windows laptop) which is why i went back to the for loops which is taking days to run. Can anyone help. My code is below (i know for loops are probably the least optimal way to approach this so this should be a great learning experience) and if any more information is needed please let me know

# Initialize variables, my.country contains all the unique countries in     mydf
my.country <- unique(mydf$country)
mydf.sample <- mydf[0, ]
myref.sample <- myref[0, ]
mydf.final <- mydf[0, ] %>%
  mutate(levdist = 0,
      town.match = '')

# For each country, Take each item in mydf, compare it to every record     in the reference table myref
# get the best levenschtein match and score
# add the levenschtein score and the city matched to mydf

for(intcountry in 1:length(my.country))
  {

  # Filter the mydf Table & myRef Table to specific countries based on the intcountry iteration
  mydf.sample <- mydf %>% 
    filter(country == my.country[intcountry])

  myref.sample <- mydf %>% 
    filter(country == my.country[intcountry])

  # Inititalize the temp vector to the size of the mydf size
    vector <- character(length(mydf.sample))

# Set Up Levenschtein Distance
# For every Record in the Dataframe to be checked
for(item in 1:nrow(mydf.sample))
{
  # For every Record in the Reference Table
  for(k in 1:nrow(myref))
  {
    vector[k] = levenshteinSim(mydf.sample$City[item],myref$city[k])
  }
  # Get index of the highest levenschtein match
  max.match.index = match(max(vector),vector)
  mydf.sample$levdist[item] = max(vector)
  mydf.sample$town.match[item] = myref$city[max.match.index]
  vector <- character(length(mydf.sample))
}
mydf.final<- rbind(mydf.sample, mydf.final)
 }
Jaap
  • 81,064
  • 34
  • 182
  • 193
John Smith
  • 2,448
  • 7
  • 54
  • 78
  • You can try n-gram. Levenshtein is essential a suffix array/tree and is difficult. – Micromega Oct 05 '15 at 10:34
  • Try reading through this [post](http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r) about speeding up your for loops. Also a few of your data.frames are not in your example. That makes it hard to check your code – phiver Oct 05 '15 at 12:54

1 Answers1

1

When speed and memory issues are important, then the data.table package is often a good alternative. As you didn't provide example data which illustrate the problem, I created some (see the end of this answer for the dputs of the used data).

1: First you have to transform your dataframes into datatables:

library(data.table)
setDT(mydf, key=c("country","city"))
setDT(myref, key=c("country","city"))

With the key=c("country","city") part you also create a reference for each datatable.

2: Now you can easily remove the entries in mydf which are also in the reference datset myref with:

mydf <- mydf[!myref]

which gives:

> mydf
   country   city
1:      GB LONDON
2:      LT KAUNAS
3:      LT VILNUS

As you can see, the record (row) for "VILNIUS" is removed from mydf but the record/row for "VILNUS" is not because it is not an exact match.

EDIT: I removed the 3rd and 4th option as they seem not to be working properly.


Used data:

mydf <- structure(list(country = c("LT","GB","LT","LT"), city = c("VILNIUS","LONDON","KAUNAS","VILNUS")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
myref <- structure(list(country = c("LT","NL"), city = c("VILNIUS","AMSTERDAM")), .Names = c("country", "city"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Hi @Jaap, I tried to set the key using your example setDT(mydf, key=c("country","city")). I got the error Error in setDT(mydf, key = c("country", "city")) : unused argument (key = c("country", "city")) – John Smith Oct 05 '15 at 13:57
  • @JohnSmith Strange. It works at my system. Which version of `data.table` are you using? – Jaap Oct 05 '15 at 14:02
  • hi @Jaap it says data.table_1.9.4 – John Smith Oct 05 '15 at 14:05
  • @JohnSmith That's an older version of `data.table`. You will have to update to v1.9.6 – Jaap Oct 05 '15 at 14:43
  • Hi @Jaap, sorry for the delay, proxy settings in work in order to update. Is there anyway using your second method we can specify the similarity...so towns 80% similar? Similarly is it possible to see what they have been matched to in the reference table so analysts can confirm manually? – John Smith Oct 05 '15 at 14:48
  • @JohnSmith See the edit for why I removed option 3 & 4. At this moment i'm stuck as well in finding a solution with `data.table`. – Jaap Oct 05 '15 at 18:58