5

Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to 'data.table' and my question was answered perfectly.

This is the link to that thread which includes all of the data and details:

Accelerate performance and speed of string match in R

But now I am running into another problem. Once in a while, the submitted VIN#s (in the 'vinDB' file) differ by one or two characters in the 'carFile' file due to human error when they fill out their car info at the DMV. Is there a way to edit the

dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]

line of that code (provided by @BrodieG in the above link) to allow for a recognition of VIN#s that differ by one or two characters?

I apologize if this is an easy correction. I am just overwhelmed by the power of the 'data.table' package in R and would love to learn as much as I can of its utility, and the knowledgable members of this forum have been absolutely pivotal to me.

**EDIT:

So I have been playing around with using 'lapply' and the 'agrep' functions as suggested and I must be doing something wrong:

I tried replacing this line:

dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]

with this:

dt <- dt[lapply(vin.vins, function(x) agrep(x,car.vins, max.distance=2)), list(NumTimesFound=.N), vin.names, allow.cartesian=TRUE]

But got the following error:

Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins,  : 
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'. 
Character columns must join to factor or character columns.

But they are both type 'chr'. Does anyone know why I am getting this error? And am I thinking about this the right way, ie: am I using lapply correctly here?

Thanks!

Jaap
  • 81,064
  • 34
  • 182
  • 193
tomathon
  • 834
  • 17
  • 32
  • 2
    This isn't going to be easy. I suggest you match what will match, then for whatever is left, compute the [Levenshtein Distance](http://stackoverflow.com/questions/3182091/fast-levenshtein-distance-in-r) for the items that were not matched (not linked package not available for R3.0.2, just there for reference) against the VINs the cars that were not matched. If you know for sure that the errors are in a particular area of the VIN string then that becomes easier, but from your description it doesn't sound that way. – BrodieG Feb 13 '14 at 19:34
  • Thanks @BrodieG. Is there perhaps a way to incorporate the 'agrep' function into the code? – tomathon Feb 13 '14 at 19:40
  • 1
    Yes, but the expensive operations won't be in `data.table`. You need to `lapply` `agrep` to every VIN you're trying to match. It can be done, but it's not trivial, and it will probably be slow depending on how many bad VINS you have an how many candidate VINS there are. Certainly not a minor edit, at least not that I can think of. – BrodieG Feb 13 '14 at 19:49
  • I see. Thank you very much, your insight into this is very helpful. Is there a link you recommend that can be used as a tutorial for the 'data.table' operations (aside from its package description page)? – tomathon Feb 13 '14 at 19:52
  • 3
    You can look at the [data.table FAQ](datatable.r-forge.r-project.org/datatable-faq.pdf) as well as look through highly rated questions/answers on the `data.table` tag on SO. – BrodieG Feb 13 '14 at 20:05
  • Im reading through it now. Thanks again. – tomathon Feb 13 '14 at 20:44

2 Answers2

6

I finally got it.

The agrep-function has a value-option that needs to be altered from FALSE (default) to TRUE:

dt <- dt[lapply(car.vins, agrep, x = vin.vins, max.distance = c(cost=2, all=2), value = TRUE)
         , .(NumTimesFound = .N)
         , by = vin.names]

Note: the max.distance parameters can be altered based on Levenshtein distance, substitutions, deletions, etc. 'agrep' is a fascinating function!

Thanks again for all the help!

Jaap
  • 81,064
  • 34
  • 182
  • 193
tomathon
  • 834
  • 17
  • 32
2

Thanks for the answer to your own question on partial matching. Here's the complete code that I got to work on my own machine (including the reproducible example provide by BrodieG in your linked post). I had to change lapply to sapply.

library(data.table)
set.seed(1)
makes <- c("Toyota", "Ford", "GM", "Chrysler")
years <- 1995:2014
cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])               
carFile <- data.frame(c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))])

vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
car.vins <- carFile[seq(2, nrow(carFile), 4), ]`

#Add some errors to car.vins strings
s <- sample(length(car.vins),100)
car.vins.err <- as.character(car.vins)
car.vins.err[s] <- gsub("A","B",car.vins.err[s])
s <- sample(length(car.vins.err),100)
car.vins.err[s] <- gsub("E","F",car.vins.err[s])
s <- sample(length(car.vins.err),100)
car.vins.err[s] <- gsub("I","J",car.vins.err[s])
car.vins.err <- as.factor(car.vins.err)`

dt <- data.table(vin.names, vin.vins, key="vin.vins")
dt1 <- dt[J(car.vins), list(NumTimesFound=.N), keyby=vin.names]
dt1.err <- dt[J(car.vins.err), list(NumTimesFound=.N), keyby=vin.names]
dt2 <- dt[sapply(car.vins, agrep, x=vin.vins, max.distance=c(cost=2, all=2), value=TRUE), list(NumTimesFound=.N), keyby=vin.names]
dt2.err <- dt[sapply(car.vins.err, agrep, x=vin.vins, max.distance=c(cost=2, all=2), value=TRUE), list(NumTimesFound=.N), keyby="vin.names"]

dt1[dt1.err][dt2.err]
greg44289
  • 55
  • 6