1

I would like to efficiently make a random sample by group from a data.table, but it should be possible to sample a different proportion for each group.

If I wanted to sample fraction sampling_fraction from each group, i could get inspired by this question and related answer to do something like:

DT = data.table(a = sample(1:2), b = sample(1:1000,20))

group_sampler <- function(data, group_col, sample_fraction){
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  data[,.SD[sample(.N, ceiling(.N*sample_fraction))],by = eval(group_col)]
}

# what % of data should be sampled
sampling_fraction = 0.5

# perform the sampling
sampled_dt <- group_sampler(DT, 'a', sampling_fraction)

But what if i wanted to sample 10% from group 1 and 50% from group 2?

ira
  • 2,542
  • 2
  • 22
  • 36
  • How do you define which is group 1 and which is group 2 – s_baldur Oct 15 '19 at 13:29
  • In the example above, the column 'a' has values 1 and 2. thus, group a and group 2. I think that to make sure that correct sampling fraction is assigned to each group, it might be possible to use a named vector or something like that in the input of the function. I am just not exactly sure how to do it – ira Oct 16 '19 at 07:33

2 Answers2

4

You can use .GRP but to ensure a correct group is matched.. you might want to define group_col as a factor variable.

group_sampler <- function(data, group_col, sample_fractions) {
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table
  #   group_col - column(s) used to group by
  #   sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
  stopifnot(length(sample_fractions) == uniqueN(data[[group_col]]))
  data[, .SD[sample(.N, ceiling(.N*sample_fractions[.GRP]))], keyby = group_col]
}

Edit in response to chinsoon12's comment:

It would be safer (instead of relying on correct order) to have the last line of the function:

data[, .SD[sample(.N, ceiling(.N*sample_fractions[[unlist(.BY)]]))], keyby = group_col]

And then you pass sample_fractions as a named vector:

group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))
s_baldur
  • 29,441
  • 4
  • 36
  • 69
0

Here's an option which uses a lookup table (and so doesn't rely on the ordering of vectors or groups).

library(data.table)
DT = data.table(group = sample(1:2), val = sample(1:1000,20))

sample_props <- data.table(group = 1:2, prop = c(.1,.5))

group_sampler <- function(data, group_col, sample_props){
  # this function samples sample_fraction <0,1> from each group in the data.table
  # inputs:
  #   data - data.table with data
  #   group_col - column(s) used to group by (must be in both data.tables)
  #   sample_props - data.table with sample proportions
  ret <- merge(DT, sample_props, by = group_col)
  ret <- ret[,.SD[sample(.N, ceiling(.N*prop))], eval(group_col)]
  return(ret[,prop := NULL][])
}

# perform the sampling
group_sampler(DT, 'group', sample_props)
#>    group val
#> 1:     1 721
#> 2:     2 542
#> 3:     2 680
#> 4:     2 613
#> 5:     2 170
#> 6:     2 175

Created on 2019-10-15 by the reprex package (v0.3.0)

ClancyStats
  • 1,219
  • 6
  • 12