2

I am trying to randomly sample a data.table using groups. The sample size of each group will be computed by multiplying the frequency with Sample_Size, which is the expected number of rows in the output data.table.

I researched this topic on SO. It seems similar threads (Need to randomly sample a data set with multiple groups each with multiple factors and take randomly sample based on groups) have assumed uniform distribution for weights, which doesn't work for me.

Here's test data:

InputDT <- data.table::data.table ("Country"=c(rep("A",20),rep("B",10),rep("C",5),rep("D",2)), "ID"=c(1:20,101:110,201:205,301:302))

The objective is to sample IDs by country.

Here's the frequency we want:

CountryFreq <- 
 data.table::data.table("Country"=unique(InputDT$Country), "Freq"=c(4/10,2/10,2/10,2/10))

Here's the number of rows in the output data.table:

 Sample_Size <- 10

As a rule, let's assume that Sample_Size < nrows(InputDT)

Here's manually created sample output:

OutputDT <- structure(list(Country = c("A", "A", "A", "A", "B", "B", "C", 
"C", "D", "D"), ID = c(1, 5, 7, 3, 102, 109, 203, 204, 301, 302
)), .Names = c("Country", "ID"), row.names = c(NA, 10L), class = "data.frame")

Here's a test to check whether frequencies are as needed:

Hmisc::describe(OutputDT$Country)

OutputDT$Country 
       n  missing distinct 
      10        0        4 

Value        A   B   C   D
Frequency    4   2   2   2
Proportion 0.4 0.2 0.2 0.2

Can someone please help me? I've spent almost one day trying to learn sampling in R and then customizing it to my need. I'd appreciate any help.

watchtower
  • 4,140
  • 14
  • 50
  • 92

2 Answers2

1

We can do

InputDT[, rbindlist(Map(function(x, y) x[sample(seq_len(nrow(x)), y)], 
        split(.SD, Country), freq))]

data

freq <- c(4, 2, 2, 2)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for your help. I believe this code will work only if `InputDT` is sorted as per `freq`, meaning there has to be 1-1 mapping between `freq` and `Country` split. Correct? – watchtower May 15 '18 at 04:57
  • 1
    @watchtower I assume that the freq is in the same order – akrun May 15 '18 at 12:46
1

You can join InputDT with country frequency first before doing the sampling for each country as follows:

InputDT[CountryFreq,
    .SD[sample(.N, min(.N, Freq*Sample_Size))], 
    by=.EACHI,
    on=.(Country)]

Notes:

InputDT[i=CountryFreq, on=.(Country)] joins CountryFreq with InputDT using Country as the key.

by=.EACHI performs this j=.SD[sample(.N, min(.N, Freq*Sample_Size))] for each row in i=CountryFreq. Note by=.EACHI only works for an equi-join for now.

.SD is Subset of Data of InputDT, i.e. each subset of data from InputDT for each Country in each row of i because of by=.EACHI. .SD is only really within the scope of InputDT and can only be used in j. See ?data.table. To learn more, check out lexical scoping. A good reference is Advanced R by Hadley Wickham.

sample(.N, min(.N, Freq*Sample_Size)) samples Freq*Sample_Size indices from the number of rows from .SD while min ensures that you dont sample more than available samples in that country.

Finally, .SD[sample(.N, min(.N, Freq*Sample_Size))] subset sampled rows of .SD.


edit: show sample runs from R console.

> InputDT[CountryFreq,
+     .SD[sample(.N, min(.N, Freq*Sample_Size))], 
+     by=.EACHI,
+     on=.(Country)]
    Country  ID
 1:       A  19
 2:       A   7
 3:       A   5
 4:       A   3
 5:       B 109
 6:       B 110
 7:       C 203
 8:       C 205
 9:       D 302
10:       D 301
> InputDT[CountryFreq,
+     .SD[sample(.N, min(.N, Freq*Sample_Size))], 
+     by=.EACHI,
+     on=.(Country)]
    Country  ID
 1:       A  12
 2:       A  19
 3:       A  17
 4:       A  10
 5:       B 110
 6:       B 105
 7:       C 202
 8:       C 203
 9:       D 302
10:       D 301
> InputDT[CountryFreq,
+     .SD[sample(.N, min(.N, Freq*Sample_Size))], 
+     by=.EACHI,
+     on=.(Country)]
    Country  ID
 1:       A   9
 2:       A   7
 3:       A  19
 4:       A   6
 5:       B 106
 6:       B 108
 7:       C 205
 8:       C 201
 9:       D 302
10:       D 301
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Thanks, but I am unable to get desired output. IDs from Country A are randomly selected instead of Country B, C and D. Not sure why. Could you please help? It always helps to learn different ways to do the same thing. Thanks in advance. – watchtower May 16 '18 at 03:23
  • i am getting different values for A, B and C when i rerun the code. D has only 2 samples, so sampling 2 out of 2 will always return the same – chinsoon12 May 16 '18 at 03:28
  • Thanks chinsoon. I am getting values from A for B, C and D. B should have values in 1xx range, C should have 2xx range, D should have 3xx range, but I am getting all values in xx range, which belongs to A. If you run akrun's code below, you will the difference. akrun's code is working flawlessly. – watchtower May 16 '18 at 04:19
  • 1
    added some sample runs. i think its because the code was missing `.SD` earlier today – chinsoon12 May 16 '18 at 05:28
  • Thank you so much. Wonderful. Could you please explain your code, if you don't mind? This way, readers can understand what's going on. Thanks in advance. – watchtower May 17 '18 at 02:59