Join data.table by sampling

Question

I have a few large data-sets that I'm trying to combine. I have created a toy example of what I want to do. I have three tables:

require(data.table)
set.seed(151)
x <- data.table(a=1:100000)
y <- data.table(b=letters[1:20],c=sample(LETTERS[1:4]))
proportion <- data.table(expand.grid(a=1:100000,c=LETTERS[1:4]))
proportion[,prop:=rgamma(4,shape = 1),by=a]
proportion[,prop:=prop/sum(prop),by=a]

The three tables are x, y, and proportion. For each element in x I want to sample from the entire table y using the probabilities from the table proportion and combine them into another table. The method that I came up with is:

temp <- setkey(setkey(x[,c(k=1,.SD)],k)[y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL],a,c)
temp <- temp[setkey(proportion,a,c)][,prop:=prop/.N,by=.(a,c)] # Uniform distribution within the same 'c' column group
chosen_pairs <- temp[,.SD[sample(.N,5,replace=FALSE,prob = prop)],by=a]

But this method is memory intensive and slow as it cross-joins the two table first and then sample from it. Is there a way to perform this task in an efficient (memory and time) way?

Why are you recalculating probabilities in the second line of your solution? — minem, Jun 02 '17 at 07:56
@MārtiņšMiglinieks I am normalizing the probabilities as for a given `(a,c)` pair there could be many `b` — A Gore, Jun 02 '17 at 14:02
Lookup `.EACHI`, when cross-joining this allows you to just deal with the joined portion in each operation, without resolving the full join in mem. — Shape, Jun 04 '17 at 17:37

minem · Answer 1 · 2017-06-02T10:49:47.430

I faced somewhat similar problem in this question. I wrapped your solution into function for better comparison:

goreF <- function(x,y,proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                    allow.cartesian = TRUE][, k := NULL],
           a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  chosen_pairs <- temp[, .SD[sample(.N, 5, replace = FALSE, prob = prop)],
                   by = a]
  chosen_pairs
}

My approach:

myFunction <- function(x, y, proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                           allow.cartesian = TRUE][, k := NULL],
             a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  chosen_pairs <- temp[, sample(.I, 5, replace = FALSE, prob = prop), by = a]
  indexes <- chosen_pairs[[2]]
  temp[indexes]
}

require(rbenchmark)
benchmark(myFunction(x, y, proportion), goreF(x, y, proportion),
      replications = 1,
      columns = c("test", "replications", "elapsed", "relative",
                  "user.self", "sys.self"))
                          test replications elapsed relative user.self sys.self
2      goreF(x, y, proportion)            1   19.83   21.323     19.35     0.13
1 myFunction(x, y, proportion)            1    0.93    1.000      0.86     0.08

Perhaps there can be found more improvements, I will update, if found any. First two operations seems too complicated, maybe they can be shortened, but, as I did not see that they impact calculation timings, I did not rewrite them.

Update:

As pointed out in question I mentioned in the beginning, you could get into trouble with myFunction, if your groups would contain only one element. So i modified it, based on comments from that post.

myFunction2 <- function(x, y, proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                               allow.cartesian = TRUE][, k := NULL],
                 a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  indexes <- temp[, .I[sample(.N, 5, replace = T, prob = prop)], by = a]
  indexes <- indexes[[2]]
  temp[indexes]
}

benchmark(myFunction(x, y, proportion), myFunction2(x, y, proportion),
          replications = 5,
          columns = c("test", "replications", "elapsed", "relative",
                      "user.self", "sys.self"))

                           test replications elapsed relative user.self sys.self
1  myFunction(x, y, proportion)            5    6.61    1.064      6.23     0.36
2 myFunction2(x, y, proportion)            5    6.21    1.000      5.71     0.26

We can see marginal speed improvement.

I was doing something similar. I should have updated the question. The step that I had been using is `chosen_pairs <- temp[temp[,.(b=sample(b,5,replace = FALSE,prob = prop)),by=a],on=c("a","b")]` which runs comparable to your function. I'll accept your answer but if someone can help me with saving memory because that's another issue that I'm tackling now. — A Gore, Jun 02 '17 at 14:03
@AGore Do you run in memory problems exactly in this operation? How big is your data and your RAM? Because i do not run in any. Maybe some data reduction can be done beforehand. — minem, Jun 02 '17 at 19:10
The dataset `x` and `y` are approximately 5 Gb and 2 Gb resp. The cross joins produces the output that is approx 12 Gb. The final output (as a result of sampling) is approximately 6 Gb. I have a RAM of 16 Gb. I cannot figure out where can I do data reduction. — A Gore, Jun 05 '17 at 12:43
@AGore can you give some more details about `x` and `y` structure, size and variable types(`str` maybe)? So I (or someone else) could think about solution... — minem, Jun 05 '17 at 13:31

Join data.table by sampling

1 Answers1

Update: