I have a dataset of rows of genes each with their gene length, I am looking to sample from these genes by their gene length distribution using rejection sampling - as I have too many genes in this dataset that are too small to enter further analysis (but I can't just set a cut-off on my own to remove them). I have 1 dataset of genes with gene lengths to sample from, and another proposal distribution of gene lengths that I want to use in order to do the rejection sampling of the first dataset.
An example of my data looks like this:
#df1 data to sample from:
Gene Length
Gene1 5
Gene2 6
Gene3 400000
Gene4 1000
Gene5 25000
Gene6 10
Gene7 50
Gene8 4
Gene9 100
Gene10 2000
My proposal dataset:
#df2
Gene Length
Gene1 5000
Gene2 60000
Gene3 400000
Gene4 1000
Gene5 25000
Gene6 10000
Gene7 50000
Gene8 4000
Gene9 1000
Gene10 2000
I don't have any statistics background, and I am trying to perform rejection sampling (my overall aim is to get a sample of genes with less extremely small genes in length to take onto further analysis).
To do rejection sampling I am trying this from a tutorial I found here:
X = df1$Length
U = df2$Length
accept = c()
count = 1
pi_x <- function(x) {
new_x = (3/2)*(x^3)+(11/8)*(x^2)+(1/6)*(x)+(1/12)
return(new_x)
}
while(count <= 50 & length(accept) < 50){
test_u = U[count]
test_x = pi_x(X[count])/(3.125*dunif(X[count],0,1))
if (test_u <= test_x){
accept = rbind(accept, X[count])
count = count + 1
}
count = count + 1
}
My problem with this is that it only selects 25 genes (my ideal sampling range for further analysis is 50-100 genes being selected) and most of these 25 genes are still very small in size after sampling. Is it that I need to transform X
in some way before running this rejection sampling code? My actual data for df1
is 800 genes with a skewed/beta distribution of gene length (most are very small). Or am I completely missing something else in my understanding? Any guidance at all would be appreciated.
Input data:
df1 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5",
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5L,
6L, 400000L, 1000L, 25000L, 10L, 50L, 4L, 100L, 2000L)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
df2 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5",
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5000L,
60000L, 400000L, 1000L, 25000L, 10000L, 50000L, 40000L, 1000L, 2000L)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
Edit:
I have also tried:
sampled <- data.frame(proposal = df2$Length)
sampled$targetDensity <- dbeta(sampled$proposal, 3,6)
maxDens = max(sampled$targetDensity, na.rm = T)
sampled$accepted = ifelse(df1$Length < sampled$targetDensity / maxDens, TRUE, FALSE)
hist(sampled$proposal[sampled$accepted], freq = F, col = "grey", breaks = 100)
curve(dbeta(x, 3,6),0,1, add =T, col = "red")
But I am certain I am not using dbeta()
correctly since sampled$targetDensity
outputs all zeros - is there a way I can fix this? I have tried changing the values in dbeta()
but haven't had any success.