2

I'm combining similar names using stringdist(), and have it working using lapply, but it's taking 11 hours to run through 500k rows and I'd like to see if a data.table solution would work faster. Here's an example and my attempted solution so far built from readings here, here, here, here, and here, but I'm not quite pulling it off:

library(stringdist)
library(data.table)
data("mtcars")
mtcars$cartype <- rownames(mtcars)
mtcars$id <- seq_len(nrow(mtcars))

I'm currently using lapply() to cycle through the strings in the cartype column and bring together those rows whose cartype names are closer than a specified value (.08).

output <- lapply(1:length(mtcars$cartype), function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ])

> output[1:3]
[[1]]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb       cartype id
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4     Mazda RX4  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4 Mazda RX4 Wag  2

[[2]]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb       cartype id
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4     Mazda RX4  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4 Mazda RX4 Wag  2

[[3]]
            mpg cyl disp hp drat   wt  qsec vs am gear carb    cartype id
Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1 Datsun 710  3

Data Table Attempt:

mtcarsdt <- as.data.table(mtcars)    
myfun <- function(x) mtcars[which(stringdist(mtcars$cartype[x], mtcars$cartype, method ="jw", p=0.08)<.08), ]

An intermediate step: This code pulls similar names based on the row's value that I manually plug into myfun(), but it repeats that value for all the rows.

res <- mtcarsdt[,.(vlist = list(myfun(1))),by=id]
res$vlist[[1]] #correctly combines the 2 mazda names
res$vlist[[6]] #but it's repeated down the line

I'm now trying to cycle through all the rows using set(). I'm close, but although the code appears to be correctly matching the text from the 12th column (cartype) it's returning the values from the first column, mpg:

for (i in 1:32) set(mtcarsdt,i ,12L, myfun(i))
> mtcarsdt
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb                   cartype id
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4                 c(21, 21)  1
 2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4                 c(21, 21)  2
 3: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1                      22.8  3

Now, this is pretty hacky, but I found that if I create a copy of the cartype column and place it in the first column it pretty much works, but there must be a cleaner way to do this. Also, it would be nice to keep the output in a list form like the lapply() output above as I have other post-processing steps set up for that format.

mtcars$cartypeorig <- mtcars$cartype
mtcars <- mtcars[,c(14,1:13)]
mtcarsdt <- as.data.table(mtcars)
for (i in 1:32) set(mtcarsdt,i ,13L, myfun(i))

 > mtcarsdt[1:14,cartype]
 [1] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"                        
 [2] "c(\"Mazda RX4\", \"Mazda RX4 Wag\")"                        
 [3] "Datsun 710"                                                 
 [4] "Hornet 4 Drive"                                             
 [5] "Hornet Sportabout"                                          
 [6] "Valiant"                                                    
 [7] "Duster 360"                                                 
 [8] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\")"               
 [9] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[10] "c(\"Merc 240D\", \"Merc 230\", \"Merc 280\", \"Merc 280C\")"
[11] "c(\"Merc 230\", \"Merc 280\", \"Merc 280C\")"               
[12] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         
[13] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         
[14] "c(\"Merc 450SE\", \"Merc 450SL\", \"Merc 450SLC\")"         
Community
  • 1
  • 1
Luke Macaulay
  • 393
  • 5
  • 14

1 Answers1

0

Have you tried using the matrix version of stringdist?

res = stringdistmatrix(mtcars$cartype, mtcars$cartype, method = 'jw', p = 0.08)

out = as.data.table(which(res < 0.08, arr.ind = T))[, .(list(mtcars[row,])), by = col]$V1

identical(out, output)
#[1] TRUE

Now, you probably can't just run the above for a 500k X 500k matrix, but you can split it into smaller pieces (pick size appropriate for your data/memory sizes):

size = 4 # dividing into pieces of size 4x4
         # I picked a divisible number, a little more work will be needed
         # if you have a residue (nrow(mtcars) = 32)
setDT(mtcars)

grid = CJ(seq_len(nrow(mtcars)/4), seq_len(nrow(mtcars)/4))

indices = grid[, {
            res = stringdistmatrix(mtcars[seq((V1-1)*size+1, (V1-1)*size + size), cartype],
                                   mtcars[seq((V2-1)*size+1, (V2-1)*size + size), cartype],
                                   method = 'jw', p = 0.08)
            out = as.data.table(which(res < 0.08, arr.ind = T))
            if (nrow(out) > 0)
              out[, .(row = (V1-1)*size+row, col = (V2-1)*size +col)]
          }, by = .(V1, V2)]

identical(indices[, .(list(mtcars[row])), by = col]$V1, lapply(output, setDT))
#[1] TRUE
eddi
  • 49,088
  • 6
  • 104
  • 155
  • I'd like to avoid the distance matrix approach (memory limit) and splitting the dataset. Splitting it would work within each matrix, but then identifying matches across multiple matrices leads to additional challenges. For example imagine 2 names were matched in one matrix and 2 very similar names were matched in another. Getting those 4 similar names together in the final data set would be challenging. Also sometimes one name will match 3 others, but one of the others won't match back to the original, something I can deal with in original approach, but would be harder with multiple matrices. – Luke Macaulay May 20 '16 at 15:49