Accelerate performance and speed of string match in R

Question

I have a performance issue I need help with. Please bear with me for the explanation:

I have a database of known Car Vin# and years (only first 4 lines of ~5,000 shown for ease):

>vinDB
>ToyotaCarola 2008
 IJDINJNDJIJKNDJIMKDK0897
 NissanAltima 1998
 LJIODJJNJDJNJDNJNJDJ7765

I also have a a .txt document that shows a unique DMV ID, a vin number, and a reference number in the following way (only 4 lines of ~55 million shown for ease):

>carFile
>#DMVcorrNumber33:1245638:563892:6378
 IJDINJNDJIJKNDJIMKDK0897
 +
 VIN#IDref6388546
 #DMVcorrNumber33:1245638:563892:6378
 LJIODJJNJDJNJDNJNJDJ7765
 +
 VIN#IDref2453663

What I would like to do is scan every second line (the VIN#) from my 'vinDB' file against every fourth line (starting with line two) of my 'carFile' file for a perfect match. If the match exists, I would like to output the name of the car, and how many times it is seen in the 'carFile' file.

So basically, I need this:

    Car          Year     NumTimesFound
ToyotaCarola     2008          238
NissanAltima     1998          1755

So far I have the following code, which works on a truncated 'carFile' file, but crashes my R program when I try it will all ~55 million lines:

VinCounter<-function(carFile, vinDB)

{
i=1   #index inner while loop
j=1   #index outer while loop
m=2   #index of vinDB, starts at '2' because first VIN# is on line 2
s=2   #index of carFile
count=0

while(j<=length(rownames(vinDB))/2)  # VIN# is on every 2nd line in vinDB file
{
  while(i<=length(rownames(carFile))/4)# VIN# is on every 4th line in carFile file
  {
    if(vinDB[m,1]==carFile[s,1])
      {
      count=count+1
      s=s+4
      }
    else
      {
      s=s+4
      }
    i=i+1
  }
 print(vinDB[m-1,1])
 print(count)
 count=0
 s=2
 i=1
 m=m+2
 j=j+1
 }  

}

So, basically, I would like to figure out how to:

1) Make the code above quicked and more efficient.

2) How to have my output be stored in a .txt or .csv file (because right now, it just shows me the output on the screen).

Thanks!

score 5 · Accepted Answer · answered Jan 27 '14 at 22:47

You can do this relatively easily with data.table:

vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
car.vins <- carFile[seq(2, nrow(carFile), 4), ]

library(data.table)
dt <- data.table(vin.names, vin.vins, key="vin.vins")
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
#         vin.names NumTimesFound
#  1:     Ford 2014            15
#  2: Chrysler 1998            10
#  3:       GM 1998             9
#  4:     Ford 1998            11
#  5:   Toyota 2000            12
# ---                            
# 75:   Toyota 2007             7
# 76: Chrysler 1995             4
# 77:   Toyota 2010             5
# 78:   Toyota 2008             1
# 79:       GM 1997             5

The main thing to understand is with J(car.vins) we are creating a one column data.table with the vins to match (J is just shorthand for data.table, so long as you use it within a data.table). By using that data.table inside dt, we are joining the list of vins to the list of cars because we keyed dt by "vin.vins" in the prior step. The last argument tells us to group the joined set by vin.names, and the middle argument that we want to know the number of instances .N for each group (.N is a special data.table variable).

Also, I made some junk data to run this on. In the future, please provide data like this.

set.seed(1)
makes <- c("Toyota", "Ford", "GM", "Chrysler")
years <- 1995:2014
cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])               
carFile <- 
  data.frame(
    c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))]
  )

Thank you very much! I need to learn the power of 'data.table' once and for all. Thanks again! — tomathon, Jan 28 '14 at 07:43

Accelerate performance and speed of string match in R

1 Answers1

Linked