I faced somewhat similar problem in this question.
I wrapped your solution into function for better comparison:
goreF <- function(x,y,proportion){
temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
allow.cartesian = TRUE][, k := NULL],
a, c)
temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
chosen_pairs <- temp[, .SD[sample(.N, 5, replace = FALSE, prob = prop)],
by = a]
chosen_pairs
}
My approach:
myFunction <- function(x, y, proportion){
temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
allow.cartesian = TRUE][, k := NULL],
a, c)
temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
chosen_pairs <- temp[, sample(.I, 5, replace = FALSE, prob = prop), by = a]
indexes <- chosen_pairs[[2]]
temp[indexes]
}
require(rbenchmark)
benchmark(myFunction(x, y, proportion), goreF(x, y, proportion),
replications = 1,
columns = c("test", "replications", "elapsed", "relative",
"user.self", "sys.self"))
test replications elapsed relative user.self sys.self
2 goreF(x, y, proportion) 1 19.83 21.323 19.35 0.13
1 myFunction(x, y, proportion) 1 0.93 1.000 0.86 0.08
Perhaps there can be found more improvements, I will update, if found any. First two operations seems too complicated, maybe they can be shortened, but, as I did not see that they impact calculation timings, I did not rewrite them.
Update:
As pointed out in question I mentioned in the beginning, you could get into trouble with myFunction
, if your groups would contain only one element. So i modified it, based on comments from that post.
myFunction2 <- function(x, y, proportion){
temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
allow.cartesian = TRUE][, k := NULL],
a, c)
temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
indexes <- temp[, .I[sample(.N, 5, replace = T, prob = prop)], by = a]
indexes <- indexes[[2]]
temp[indexes]
}
benchmark(myFunction(x, y, proportion), myFunction2(x, y, proportion),
replications = 5,
columns = c("test", "replications", "elapsed", "relative",
"user.self", "sys.self"))
test replications elapsed relative user.self sys.self
1 myFunction(x, y, proportion) 5 6.61 1.064 6.23 0.36
2 myFunction2(x, y, proportion) 5 6.21 1.000 5.71 0.26
We can see marginal speed improvement.