I am trying to randomly sample a data.table
using groups. The sample size of each group will be computed by multiplying the frequency with Sample_Size
, which is the expected number of rows in the output data.table
.
I researched this topic on SO. It seems similar threads (Need to randomly sample a data set with multiple groups each with multiple factors and take randomly sample based on groups) have assumed uniform distribution for weights, which doesn't work for me.
Here's test data:
InputDT <- data.table::data.table ("Country"=c(rep("A",20),rep("B",10),rep("C",5),rep("D",2)), "ID"=c(1:20,101:110,201:205,301:302))
The objective is to sample IDs by country.
Here's the frequency we want:
CountryFreq <-
data.table::data.table("Country"=unique(InputDT$Country), "Freq"=c(4/10,2/10,2/10,2/10))
Here's the number of rows in the output data.table
:
Sample_Size <- 10
As a rule, let's assume that Sample_Size < nrows(InputDT)
Here's manually created sample output:
OutputDT <- structure(list(Country = c("A", "A", "A", "A", "B", "B", "C",
"C", "D", "D"), ID = c(1, 5, 7, 3, 102, 109, 203, 204, 301, 302
)), .Names = c("Country", "ID"), row.names = c(NA, 10L), class = "data.frame")
Here's a test to check whether frequencies are as needed:
Hmisc::describe(OutputDT$Country)
OutputDT$Country
n missing distinct
10 0 4
Value A B C D
Frequency 4 2 2 2
Proportion 0.4 0.2 0.2 0.2
Can someone please help me? I've spent almost one day trying to learn sampling in R and then customizing it to my need. I'd appreciate any help.