Optimizing calculation of combinations into list - large data set

Question

I wonder if someone can figure out a faster way to calculate combinations of elements in vector. My approach works but is slow with about 6 million elements in the vector.

Test vector

test.vector <- c("335261 344015 537633","22404 132858","254654 355860 488288","219943 373817","331839 404477")

My approach

lapply(strsplit(test.vector, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))

Expected output

[[1]]
[1] "335261344015" "335261537633" "344015537633"

[[2]]
[1] "22404132858"

[[3]]
[1] "254654355860" "254654488288" "355860488288"

[[4]]
[1] "219943373817"

[[5]]
[1] "331839404477"

also, combn is written in R and hence slow. see here https://stackoverflow.com/q/26828301/3576984 — MichaelChirico, Jun 07 '18 at 23:00
You can also use `parLapply()` to improve performance (not by a lot depending on how many clusters you can use). — VFreguglia, Jun 07 '18 at 23:01
`Map(combn,strsplit(test.vector, " "),2,c(paste0),collapse="")?`or even `lapply(strsplit(test.vector, " "),combn,2,paste0,collapse="")`? you are using `apply` inside the `lapply` function.. this reduces the speed — Onyambu, Jun 07 '18 at 23:04
To include the uniqueness, you can do `lapply(strsplit(test.vector, " "),function(x) combn(unique(x),2,paste0,collapse=""))` — Onyambu, Jun 07 '18 at 23:12
also I highly recommend rstudios's built in code profiler, which will quickly show to your your code's bottlenecks — MichaelChirico, Jun 07 '18 at 23:44
Is there a limit on our numbers? For example, can the individual numbers before they are split be greater than say 10^7? Are you going to use the results computationally? — Joseph Wood, Jun 08 '18 at 00:32
@JosephWood yes, they go up to about 600000 and I use the results for subsequent calculations — CER, Jun 08 '18 at 01:11
@Onyambu for my whole dataset I run into > Error in combn(unique(x), 2, paste0, collapse = "") : n < m with the `lapply(strsplit(test.vector, " "),function(x) combn(unique(x),2,paste0,collapse=""))` — CER, Jun 08 '18 at 01:16
That is because you end up with a vector that has no data or just 1 number... If you look at your function, unique was applied AFTER the combination had been done... so you can try that one `lapply(strsplit(test.vector, " "),function(x) unique(combn(x,2,paste0,collapse="")))` — Onyambu, Jun 08 '18 at 01:19

Joseph Wood · Accepted Answer · 2018-06-10T23:21:27.563

Here is an answer that is over 25x faster than the OP's solution on large test cases. It doesn't rely on paste, but rather we take advantage of properties of numbers and vectorized operations. We also use comboGeneral from the RcppAlgos package (I am the author) which is much faster than combn and combnPrim from the linked answer for generating combinations of a vector. First we show the efficiency gains of comboGeneral over the other functions:

## library(gRbase)
library(RcppAlgos)
library(microbenchmark)

microbenchmark(gRbase::combnPrim(300, 2), combn(300, 2), 
               comboGeneral(300, 2), unit = "relative")
Unit: relative
                     expr        min         lq      mean     median         uq       max neval
gRbase::combnPrim(300, 2)   5.145654   5.192439   4.83561   7.167839   4.320497   3.98992   100
            combn(300, 2) 204.866624 192.559119 143.75540 174.079339 102.733367 539.12325   100
     comboGeneral(300, 2)   1.000000   1.000000   1.00000   1.000000   1.000000   1.00000   100

Now, we create a function to create some random reproducible data that will be passed to our test functions:

makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
    set.seed(mySeed)
    sapply(1:vectorSize, function(x) {
        paste(sample(10^6, s1 <- sample(2:elementSize, 1), replace = withRep), collapse = " ")
    })
}

makeTestSet(5, 3)
[1] "937076 286140 830446" "519096 736588 134667" "705065 457742 719111" 
[4] "255429 462293 940013" "117488 474997 560332"

That looks good. Now, lets see if setting fixed = TRUE gets us any gains (as suggested above by @MichaelChirico):

bigVec <- makeTestSet(10, 100000)

microbenchmark(standard = strsplit(bigVec, " "), 
               withFixed = strsplit(bigVec, " ", fixed = TRUE), 
               times = 15, unit = "relative")
Unit: relative
     expr      min       lq     mean   median       uq      max neval
 standard 4.447413 4.296662 4.133797 4.339537 4.084019 3.415639    15
withFixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    15

@MichaelChirico was spot on. Putting it all together we get:

combPairFast <- function(testVec) {
    lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
        combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
        unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
    })
}

## test.vector defined above by OP
combPairFast(test.vector)
[[1]]
[1] 335261344015 335261537633 344015537633

[[2]]
[1] 22404132858

[[3]]
[1] 254654355860 254654488288 355860488288

[[4]]
[1] 219943373817

[[5]]
[1] 331839404477

## OP original code
combPairOP <- function(testVec) {
    lapply(strsplit(testVec, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
}

As stated in the comments by the OP, the maximum number is less than a million (600000 to be exact), which means that after we multiply one of the numbers by at most 10^6 and add it to another 6 digit number (equivalent to simply concatenating two strings of numbers), we are guaranteed to be within the numerical precision of base R (i.e. 2^53 - 1). This is good because arithmetic operations on numerical numbers is much more efficient than strings operations.

All that is left is to benchmark:

test.vector <- makeTestSet(100, 50)

microbenchmark(combPairOP(test.vector), 
               combPairFast(test.vector),
               times = 20, unit = "relative")
Unit: relative
                     expr      min      lq     mean   median     uq      max neval
  combPairOP(test.vector) 22.33991 22.4264 21.67291 22.11017 21.729 25.23342    20
combPairFast(test.vector)  1.00000  1.0000  1.00000  1.00000  1.000  1.00000    20

And on larger vectors:

bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)

## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
    any(duplicated(x))
}))
[1] TRUE

system.time(t1 <- combPairFast(bigTest.vector))
 user  system elapsed 
0.303   0.011   0.314 

system.time(t2 <- combPairOP(bigTest.vector))
 user  system elapsed 
8.820   0.081   8.902    ### 8.902 / 0.314 ~= 28x faster

## results are the same
all.equal(t1, lapply(t2, as.numeric))
[1] TRUE

Yours runs super fast but I can't get the results all.equal TRUE with your outcome and mine using my old code `lapply(strsplit(df$full.number, " ",fixed = T), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))` — CER, Jun 08 '18 at 02:39
Are you sorting and converting to numeric as I did in the example above? — Joseph Wood, Jun 08 '18 at 02:41
@CER, That is exactly what I was about to write... If you do unique (as.numeric(x)) it should work. I will update my answer when I get a chance, but the idea still holds. That is, arithmetic is faster than paste. — Joseph Wood, Jun 08 '18 at 02:43
Maybe I just doing something wrong but even just using your `combPair()` and `combPairfast()` and your `bigTest.vector`and than running the all.equal() does not give me a TRUE. `set.seed(22)` — CER, Jun 08 '18 at 03:14
@CER, I just saw your comments.... I reran with `set.seed(22)` and when I run `all.equal(lapply(t1, sort), lapply(t2, function(x) sort(as.numeric(x))))`, it returns `TRUE`. I'm not sure what you are doing, but there is no way to know unless you post a specific example where the solutions differ. I still haven't forgotten about updating to account for non-unique values (doing it right now). — Joseph Wood, Jun 08 '18 at 11:02

Optimizing calculation of combinations into list - large data set

1 Answers1