Here is an answer that is over 25x
faster than the OP's solution on large test cases. It doesn't rely on paste
, but rather we take advantage of properties of numbers and vectorized operations. We also use comboGeneral
from the RcppAlgos
package (I am the author) which is much faster than combn
and combnPrim
from the linked answer for generating combinations of a vector. First we show the efficiency gains of comboGeneral
over the other functions:
## library(gRbase)
library(RcppAlgos)
library(microbenchmark)
microbenchmark(gRbase::combnPrim(300, 2), combn(300, 2),
comboGeneral(300, 2), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
gRbase::combnPrim(300, 2) 5.145654 5.192439 4.83561 7.167839 4.320497 3.98992 100
combn(300, 2) 204.866624 192.559119 143.75540 174.079339 102.733367 539.12325 100
comboGeneral(300, 2) 1.000000 1.000000 1.00000 1.000000 1.000000 1.00000 100
Now, we create a function to create some random reproducible data that will be passed to our test functions:
makeTestSet <- function(vectorSize, elementSize, mySeed = 42, withRep = FALSE) {
set.seed(mySeed)
sapply(1:vectorSize, function(x) {
paste(sample(10^6, s1 <- sample(2:elementSize, 1), replace = withRep), collapse = " ")
})
}
makeTestSet(5, 3)
[1] "937076 286140 830446" "519096 736588 134667" "705065 457742 719111"
[4] "255429 462293 940013" "117488 474997 560332"
That looks good. Now, lets see if setting fixed = TRUE
gets us any gains (as suggested above by @MichaelChirico):
bigVec <- makeTestSet(10, 100000)
microbenchmark(standard = strsplit(bigVec, " "),
withFixed = strsplit(bigVec, " ", fixed = TRUE),
times = 15, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
standard 4.447413 4.296662 4.133797 4.339537 4.084019 3.415639 15
withFixed 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 15
@MichaelChirico was spot on. Putting it all together we get:
combPairFast <- function(testVec) {
lapply(strsplit(testVec, " ", fixed = TRUE), function(x) {
combs <- RcppAlgos::comboGeneral(as.numeric(x), 2)
unique(combs[,1] * (10)^(as.integer(log10(combs[,2])) + 1L) + combs[,2])
})
}
## test.vector defined above by OP
combPairFast(test.vector)
[[1]]
[1] 335261344015 335261537633 344015537633
[[2]]
[1] 22404132858
[[3]]
[1] 254654355860 254654488288 355860488288
[[4]]
[1] 219943373817
[[5]]
[1] 331839404477
## OP original code
combPairOP <- function(testVec) {
lapply(strsplit(testVec, " "), function(x) unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))))
}
As stated in the comments by the OP, the maximum number is less than a million (600000 to be exact), which means that after we multiply one of the numbers by at most 10^6 and add it to another 6 digit number (equivalent to simply concatenating two strings of numbers), we are guaranteed to be within the numerical precision of base R (i.e. 2^53 - 1
). This is good because arithmetic operations on numerical numbers is much more efficient than strings operations.
All that is left is to benchmark:
test.vector <- makeTestSet(100, 50)
microbenchmark(combPairOP(test.vector),
combPairFast(test.vector),
times = 20, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
combPairOP(test.vector) 22.33991 22.4264 21.67291 22.11017 21.729 25.23342 20
combPairFast(test.vector) 1.00000 1.0000 1.00000 1.00000 1.000 1.00000 20
And on larger vectors:
bigTest.vector <- makeTestSet(1000, 100, mySeed = 22, withRep = TRUE)
## Duplicate values exist
any(sapply(strsplit(bigTest.vector, " ", fixed = TRUE), function(x) {
any(duplicated(x))
}))
[1] TRUE
system.time(t1 <- combPairFast(bigTest.vector))
user system elapsed
0.303 0.011 0.314
system.time(t2 <- combPairOP(bigTest.vector))
user system elapsed
8.820 0.081 8.902 ### 8.902 / 0.314 ~= 28x faster
## results are the same
all.equal(t1, lapply(t2, as.numeric))
[1] TRUE