2

I have spent some time trying to figure out what is happening when randomly sampling rows from data.table by groups as here, here or here and got stuck with sampling within i group.

Starting with some setup:

library(data.table)
library(magrittr)

seed <- 2016

set.seed(seed)
size <- 10
dt <- data.table(
  id = 1:size, A = sample(letters[1:3], size, replace = TRUE), B = 'N',
  C = sample(1:100, size, replace = TRUE) + sample(30:70, size, replace = TRUE))

Check which rows should be sampled:

set.seed(seed)
dt[, .N, by = A]
dt[, .N, by = A][, N] %>% sapply(function(x) { sample(x, round(x*0.5)) })
#    A N
# 1: a 6
# 2: c 2
# 3: b 2
#
# Which gives the following rows:
# a: 2, 1, 4
# c: 1
# b: 1
# 
# So the result should be:
#     id A B   C | order in sampled dt
#  1:  1 a N  82 | 2
#  2:  2 a N  86 | 1
#  3:  3 c N  68 | 4
#  4:  4 a N 140 | 
#  5:  5 b N  92 | 5
#  6:  6 a N  94 | 3
#  7:  7 b N 102 | 
#  8:  8 c N  69 | 
#  9:  9 a N 126 | 
# 10: 10 a N  56 | 

To confirm: Order of A is the order in which its values apper in the dt?

Some "correct" samplings:

# Results as below or just columns A and I (or V1) with ids:
#    id A B  C
# 1:  2 a N 86
# 2:  1 a N 82
# 3:  6 a N 94
# 4:  3 c N 68
# 5:  5 b N 92

# Get .I and sample from them:
set.seed(seed)
dt[, .I, by = A] %>%
  .[, .SD[sample(.N, round(.N*0.5))], by = A]
set.seed(seed)
dt[, .I[sample(.N, round(.N*0.5))], by = A]
# Both returning expected ids

# Sample from .SD
set.seed(seed)
dt[, .SD[sample(.N, round(.N*0.5))], by = A]
# Correct, but populating each .SD, i.e. can be slow

# Sample from .I and use in i (to e.g. change some values in j)
set.seed(seed)
dt[dt[, .I[sample(.N, round(.N*0.5))], by = A]$V1, ]
set.seed(seed)
dt[dt[, sample(.I, round(.N*0.5)), by = A]$V1, ]
# Correct and faster than above

The question: What are those sampling from?

set.seed(seed)
dt[sample(.N, round(.N*0.5)), .I, by = A]
#    A  I
# 1: a  2
# 2: a 10
# 3: a  1
# 4: b  7
# 5: c  3

set.seed(seed)
dt[sample(.N, round(.N*0.5)), .SD, by = A]
#    id A B   C
# 1:  2 a N  86
# 2: 10 a N  56
# 3:  1 a N  82
# 4:  7 b N 102
# 5:  3 c N  68

This is clearly related to sampling within i, but I cannot figure out what exactly is happening.

Community
  • 1
  • 1
m-dz
  • 2,342
  • 17
  • 29
  • 1
    `.N` is the number of rows within `dt`. `sample(.N, round(.N*0.5))` is sampling half of the rows out of `dt`. `.I` is showing which rows were sampled. `by = A` is forcing `data.table` to take in count the *original locations* of the rows within the *remaining* groups in `A` (*after sampling*). In other words, if you do `dt[sample(.N, round(.N*0.5)), .I]` it will just show you the remaining rows. In the second case `by = A` (and ` .SD`) is doing nothing more than ordering and you will get the same (unsorted) result with just `dt[sample(.N, round(.N*0.5))]`. – David Arenburg Feb 12 '17 at 14:13
  • Thanks! To clarify: `dt[sample(.N, round(.N*0.5))]` is sampling half of the rows, then `dt[sample(.N, round(.N*0.5)), .I]` is returning `1:5`, i.e. row numbers from the result of sampling, whereas `dt[sample(.N, round(.N*0.5)), .I, by = A]` gives the correct row positions from the original `dt`. But if I use `.SD` this is no longer true and returned rows are always drawn from the original `dt`, whether `by = A` is used or not. – m-dz Feb 12 '17 at 14:40
  • So the confusion comes from `.N` being used in `i` where it always return the number of rows of the whole `dt` not by groups, as in `j`. – m-dz Feb 12 '17 at 14:42
  • 2
    Yes, because (apparently) `by = A` stores some information regarding the `A` group *prior the sampling* (although `by` always comes after the `i`) which `.I` can use. I guess it comes useful but I'm too lazy to search in docs to see if this actually documented as this indeed seem a bit counter-intuitive to me. Regarding your second comment, I see no confusion. The `i` part is *always* executed before the `j` and `by`. Hence, it is expected that `.N` within `i` will ignore `by`. The order usually is `i -> by -> do stuff in j within the i subset and within each group in by`. – David Arenburg Feb 12 '17 at 14:43
  • I should have written "my confusion", I assumed `by` would be evaluated first. Now it is clear, thanks again. Re. the documentation, I tried to look for this issue, without much success unfotunately. – m-dz Feb 12 '17 at 15:27
  • 2
    If by "this issue" you mean that `i` is evaluated before `by`, the doc at `?data.table` opens by directing to the first vignette, which covers this in section f. Also, the Details section of the doc starting "The general form ..." covers it. – Frank Feb 12 '17 at 16:01
  • 1
    I thought more about the `dt[sample(.N, round(.N*0.5)), .I, by = A]` returning row positions from the "original" `dt`, not the sampled one. – m-dz Feb 12 '17 at 16:12
  • 1
    Oh ok. There's an open issue related to that: https://github.com/Rdatatable/data.table/issues/1494 – Frank Feb 12 '17 at 19:03

0 Answers0