2

I have a dataset of rows of genes each with their gene length, I am looking to sample from these genes by their gene length distribution using rejection sampling - as I have too many genes in this dataset that are too small to enter further analysis (but I can't just set a cut-off on my own to remove them). I have 1 dataset of genes with gene lengths to sample from, and another proposal distribution of gene lengths that I want to use in order to do the rejection sampling of the first dataset.

An example of my data looks like this:

#df1 data to sample from:
Gene  Length
Gene1  5
Gene2  6
Gene3  400000
Gene4  1000
Gene5  25000
Gene6  10
Gene7  50
Gene8  4
Gene9  100
Gene10 2000

My proposal dataset:

#df2
Gene  Length
Gene1  5000
Gene2  60000
Gene3  400000
Gene4  1000
Gene5  25000
Gene6  10000
Gene7  50000
Gene8  4000
Gene9  1000
Gene10 2000

I don't have any statistics background, and I am trying to perform rejection sampling (my overall aim is to get a sample of genes with less extremely small genes in length to take onto further analysis).

To do rejection sampling I am trying this from a tutorial I found here:

X = df1$Length
U = df2$Length

accept = c()
count = 1

pi_x <- function(x) {
  new_x = (3/2)*(x^3)+(11/8)*(x^2)+(1/6)*(x)+(1/12)
  return(new_x)
}


while(count <= 50 & length(accept) < 50){
  test_u = U[count]
  test_x = pi_x(X[count])/(3.125*dunif(X[count],0,1))
  if (test_u <= test_x){
    accept = rbind(accept, X[count])
    count = count + 1
  }
  count = count + 1
}

My problem with this is that it only selects 25 genes (my ideal sampling range for further analysis is 50-100 genes being selected) and most of these 25 genes are still very small in size after sampling. Is it that I need to transform X in some way before running this rejection sampling code? My actual data for df1 is 800 genes with a skewed/beta distribution of gene length (most are very small). Or am I completely missing something else in my understanding? Any guidance at all would be appreciated.

Input data:

df1 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5L, 
6L, 400000L, 1000L, 25000L, 10L, 50L, 4L, 100L, 2000L)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

df2 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6", "Gene7", "Gene8", "Gene9", "Gene10"), Length = c(5000L, 
60000L, 400000L, 1000L, 25000L, 10000L, 50000L, 40000L, 1000L, 2000L)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

Edit:

I have also tried:

sampled <- data.frame(proposal = df2$Length)
sampled$targetDensity <- dbeta(sampled$proposal, 3,6)

maxDens = max(sampled$targetDensity, na.rm = T)
sampled$accepted = ifelse(df1$Length < sampled$targetDensity / maxDens, TRUE, FALSE)

hist(sampled$proposal[sampled$accepted], freq = F, col = "grey", breaks = 100)
curve(dbeta(x, 3,6),0,1, add =T, col = "red")

But I am certain I am not using dbeta() correctly since sampled$targetDensity outputs all zeros - is there a way I can fix this? I have tried changing the values in dbeta() but haven't had any success.

DN1
  • 234
  • 1
  • 13
  • 38
  • What is your target distribution? In your example code you seem to use the density function from the tutorial, but I guess that cannot be what you wanted. If you would like to sample the genes based on their length (i.e. longer genes have higher probability to be chosen) then I would suggest to just use base-R's `sample()` function instead of trying to implement rejection sampling yourself (which would be an overkill in this case anyway.) – AEF May 26 '21 at 13:36

1 Answers1

0

If you know the number of genes you want to sample, the sample function should work fine:

sampled = sample(df$genes, size = n, prob = df$length) 

if you want to reduce even more the probability of sampling a gene with a small length, you can square the length for the prob argument. The argument prob is associating a sampling probability to each element (here based on the length)

If you don't know the number of genes you want to obtain, then you can define your own probability function:

rejection.prob = function(x){
  if (x<too.small) {return(0)} # all genes smaller than too.small won't be sampled
  if (x > it.is.ok) {return(1)} # all these ones will be sampled
  if (x>too.small & (x<it.is.ok){
    # here make any function that is equal to 0 when x == too.small
    # and 1 when x == it.is.ok
    # it can be a simple linear function
}

note that you can also use the output of the rejection.prob function for the sample function.

Depending on what you expect, you might want to have your rejection function to be more continuous (without these breaks at too.small and it.is.ok). If it's the case, I would use a sigmoid function where you can tune your parameters depending on the desired output.

glagla
  • 611
  • 4
  • 9
  • 1
    How does the custom function in your answer help if OP doesn't know the number of genes to obtain? I would argue that you cannot use any sampling method if you don't know how many samples you'd like to obtain. – AEF May 26 '21 at 15:59
  • 1
    Using the ```sample()``` function and squaring the gene length as the ```prob``` looks like it gives me a sample I can use (I have a rough idea the number of samples I want so I can set a number for now). My only question is just for my understanding of the stats - why does squaring reduce the chance of smaller genes? Is there a stats resource where I can go to teach myself if the answer is a simple one? – DN1 May 27 '21 at 09:34
  • @DN1 Squaring increases the ratio between the lengths, it has nothing to do with statistics. Example: length 90 is 10 times as large as length 9 (and so will be chosen 10 times as likely). But length 90^2=8100 is 100 times as large as 9^2 = 81 (and will as such be chosen 100 times as likely). – AEF May 27 '21 at 10:46
  • 3
    @DN1 Note that from a statistics perspective it makes absolutely no sense to create your data this way. If you have a target distribution that is known as meaningful from previous research, sampling is a good strategy. But just altering the weights (e.g. by squaring) until the result "looks good" is borderline to falsifying the data. It would be much better to just choose a certain cut-off if you want to remove genes. At least this way it is clear what happend and why it happend; even if the cut-off cannot be theoretically justified. – AEF May 27 '21 at 10:50