1

I'm trying to create a subsample of a big dataset with various n's for each factor. I want to be able to do this very fast, because I'm doing it hundreds of thousands of times. Could you help me optimise this process using data.tables?

What I do now is add a "rownumber" rn to the data.table, sample that using the various n's, and then do the subset based on that. I think there must be a smarter way of doing this, but I cannot seem to figure it out.

# generate data.table
DT <- data.table(rn = 1:100, factor = letters[1:3],
  value = rnorm(100, c(1, 5, 10)))

# subset based on "row number" with various numbers per category
subsetrn <- DT[, .(rn = sample(rn,
  if (factor == "a") 12
  else if (factor == "b") 20
  else if (factor == "c") 5
  else NULL, replace = TRUE)),
  by = factor]

# subset
ss <- DT[rn %in% subsetrn[, rn]]

EDIT: I've seen this way of quickly sampling from a data.table, but it doesn't do it by factor: https://stackoverflow.com/a/33201094/5252333 I've also seen tricks on how to do it for each factor in equal amounts, but not for each factor in different amounts.

EDIT2: I've been playing around with the solution by @akron but am still having trouble:

If I have one of the factors at 0, something is wrong:

# generate data.table
DT <- data.table(factor = letters[1:4],
  value = rnorm(300, c(1, 5, 10)))

# subset based on "row number" with various numbers per category subsetrn

# index data table with numbers
id <- data.table(factor = letters[1:4], val = c(12, 20, 5, 0))
# map our DT onto it, then subsample by the new val
ssid <- DT[id, on = .(factor)][, sample(.I, val[1], replace = TRUE), factor]
# subset
ss <- DT[ssid[, V1]]
count(ss[, factor])
##   x freq
## 1 a   10
## 2 b   12
## 3 c    5
## 4 d   10

# this is wrong! It only works if I do it like this
ss <- DT[id, on = .(factor)][ssid[, V1]]

I would like to be able to get ssid such that I can just DT[ssid[, V1]] (or DT[ssid]), so that I can do everything by reference instead of making a local copy of the DT. In my application, this is all wrapped in a function that currently makes a copy of a small part of DT 50k times, taking ~25 minutes. The function performs some calculations with the subset and then returs the output. This is slow, and I'd like to figure out if it's possible to do it by reference somehow.

This question might become a bit too specific now ;-).

Japhir
  • 564
  • 3
  • 17
  • Try `DT[data.table(factor = letters[1:3], val = c(12, 20, 5)), on = .(factor)][, sample(DT$rn, val[1], replace = TRUE), factor]` – akrun Oct 18 '17 at 12:05
  • @akrun: that returns a data.table with `factor` and the `rn` that it has sampled, so I'd still have to subset `DT` after that. So it's a nice replacement of all the `if (else)` statements, but I was wondering if this whole thing was possible in one step so I don't need the `rn` at all. – Japhir Oct 18 '17 at 12:19
  • There is no need for `rn`. You can use `.I` instead of `rn` – akrun Oct 18 '17 at 12:28
  • the difference between these two options is apparently already noticeable: using microbenchmark on the two ways of getting `rn` 10k times gives me a mean of 95.43 nanoseconds for my if/else method and for your `on =` method 31.47 nanoseconds. – Japhir Oct 18 '17 at 12:31

1 Answers1

1

We could join with a key/value dataset and use .I to sample

DT[DT[data.table(factor = letters[1:3], val = c(12, 20, 5)), 
      on = .(factor)][, sample(.I, val[1], replace = TRUE), factor]$V1]

If we split this into parts-

data.table(factor = letters[1:3], val = c(12, 20, 5))

is a key/value data.table to get the 'val' as a column on the original dataset by joining on the 'factor`.

In the second step, we do the joining

DT[data.table(factor = letters[1:3], val = c(12, 20, 5)), 
      on = .(factor)]

Now, we sample the row index, grouped by 'factor', specifying the size as the first element of 'val', extract the rowindex column $V1 and use this to subset the original dataset. i.e.

DT[....$V1]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Could you perhaps explain how this works and what the `on =` does?Also, do you know if setting a key beforehand could somehow increase the speed of the subset? EDIT: Found the [documentation^pdf](https://cran.r-project.org/web/packages/data.table/data.table.pdf) as well as [this vignette](https://github.com/Rdatatable/data.table/blob/master/vignettes/datatable-secondary-indices-and-auto-indexing.Rmd#2-fast-subsetting-using-on-argument-and-secondary-indices) on `on`. Still, an explanation would be much appreciated. – Japhir Oct 18 '17 at 21:12
  • 1
    @Japhir Added more description into it. In this case, setting the key may not increase the speed. But, you can try with the benchmarks – akrun Oct 19 '17 at 01:52
  • I just noticed a special case that I apparently have, where if you do not assign a number of observations to each factor in DT, it doesn't work. To fix this I had to `DT[indexDT, on = .(factor)][...$V1]`. – Japhir Oct 19 '17 at 17:06
  • @Japhir Sorry, I didn't get this special case. Are you saying that that number of rows are less? – akrun Oct 19 '17 at 17:08
  • No, for the mwe it would be like the factor in the data table that it is put on contains a and b, but no c. I'm now thinking I should just set the val of c - and in my real case many others - to 0. That would solve the issue more elegantly. – Japhir Oct 19 '17 at 21:35
  • Yes, the number of rows are less. I'm making a subset of a big data.table based on a.n observations of "a", b.n of "b", etc. x.n < nrow(DT[factor == "x", ] and sometimes x.n == 0, so we always end up with fewer rows. This means that the rownumber references do not work in this case... Even when I neatly set all factor values to 0. – Japhir Oct 20 '17 at 10:26
  • I've edited my question to show the problem. I hope you can understand what I'm trying to do ;). – Japhir Oct 21 '17 at 15:14
  • @Japhir Could you please post as a new question – akrun Oct 21 '17 at 16:20