1

I have two vectors of correlations: one which represents real correlations, and the other permuted correlations (the null distribution). I want to find the correlation value that corresponds to a FDR of 0.05.

Updated approach:

cor_real=rnorm(1000,0,sd=0.2)
cor_null=rnorm(1000,0,sd=0.15)

d_real=density(cor_real,from=max(min(cor_real),min(cor_null)),to=min(max(cor_real),max(cor_null)))
d_null=density(cor_null,from=max(min(cor_real),min(cor_null)),to=min(max(cor_real),max(cor_null)))
# here we ensure that the x values are comparable between the two densities

plot(d_real)
lines(d_null)

enter image description here

Then, to find the correlation value that corresponds to FDR = 0.05, my guess would be:

ratios=d_null$y/d_real$y
d_real$x[which(round(ratios,2)==.05)]
[1] 0.5694628 0.5716372 0.5868581 0.5890325 0.5912069
# this the the correlation value(s) that corresponds to a 5% chance of a false positive

Is this the right approach?


E.g.:

cor_real=rnorm(100,0.25,sd=0.1)
cor_null=rnorm(100,0.2,sd=0.1)

h_real=hist(cor_real,plot=F)
h_null=hist(cor_null,plot=F)

plot(h_null,col=rgb(1,0,0,.5),xlim=c(0,1),ylim=c(0,max(h_real$counts))) # in red
plot(h_real,col=rgb(0,.5,.5,0.25),add=T) # in blue

Real = blue, null = red

I think this is when the ratio of the frequencies of the two histograms = 0.05 (null:real), but I'm not 100% sure about that.

How might I find the correlation value that corresponds to FDR = 0.05, having "access" to both the null and real distributions?

  • 2
    Pleaes read the information at the top of the [tag:r] tag page on how to ask a question. In particular please provide a reproducible example. – G. Grothendieck Dec 22 '20 at 19:13
  • Fair enough, but I don't have a reproducible example to provide because I'm not sure how to do the operation to begin with. – Rebecca Eliscu Dec 22 '20 at 19:14
  • 2
    Well, you could at least provide two vectors of correlations just to make it clear what your input data looks like. I'm not sure what ratio you are talking about. What's the numerator and denominator here. If you have no data or code, it might be better to ask on [stats.se] where purely statistical questions are on topic. – MrFlick Dec 22 '20 at 19:17
  • 1
    Understood, updated. – Rebecca Eliscu Dec 22 '20 at 19:37
  • You are using an absolute frequency histogram, when the value you are looking for is relative, as it is FDR= 0.05; you could change `h_real` and `h_null` from absolute to relative frequency histogram and then it would be possible to get the cor_null value at which h_real and h_null would be the same... – AlSub Dec 22 '20 at 20:00
  • Would it be appropriate to calculate density, then, of cor_real and cor_null, and take the ratios of the densities such that d_null$y/d_real$y=0.05? I'm not sure if this is equivalent to relative frequency. I've updated my question to reflect this approach. – Rebecca Eliscu Dec 22 '20 at 20:04

1 Answers1

1

Density is not quite correct because 1. you did not set n and from, to to be the same, 2. it calculates the number of false positive / number of false negative at only 1 bin.

False discovery rate is defined as FP / (FP + TP). See this post too. We can calculate this once we put the two correlations in the same vector, label and order them:

set.seed(321)
cor_real=rnorm(1000,0,sd=0.2)
cor_null=rnorm(1000,0,sd=0.15)

df = data.frame(rho = c(cor_real,cor_null),
                type = rep(c(TRUE,FALSE),each=1000))
df$rho = abs(df$rho)
df = df[order(df$rho,decreasing=TRUE),]

df$FP = cumsum(df$type == FALSE)
df$TP = cumsum(df$type == TRUE)

df$FDR = df$FP / (df$FP + df$TP)

If you look at the results,

head(df,25)
           rho  type FP TP        FDR
366  0.5822139  TRUE  0  1 0.00000000
247  0.5632078  TRUE  0  2 0.00000000
298  0.5594879  TRUE  0  3 0.00000000
147  0.5460875  TRUE  0  4 0.00000000
781  0.5373146  TRUE  0  5 0.00000000
760  0.5367116  TRUE  0  6 0.00000000
797  0.5216281  TRUE  0  7 0.00000000
569  0.5204598  TRUE  0  8 0.00000000
374  0.5200687  TRUE  0  9 0.00000000
744  0.5101275  TRUE  0 10 0.00000000
864  0.5058457  TRUE  0 11 0.00000000
227  0.4997959  TRUE  0 12 0.00000000
66   0.4993164  TRUE  0 13 0.00000000
14   0.4886520  TRUE  0 14 0.00000000
830  0.4840573  TRUE  0 15 0.00000000
261  0.4765394  TRUE  0 16 0.00000000
1163 0.4703764 FALSE  1 16 0.05882353
27   0.4661862  TRUE  1 17 0.05555556
965  0.4633883  TRUE  1 18 0.05263158
530  0.4608271  TRUE  1 19 0.05000000
96   0.4575683  TRUE  1 20 0.04761905
851  0.4563224  TRUE  1 21 0.04545455
922  0.4516161  TRUE  1 22 0.04347826
343  0.4511517  TRUE  1 23 0.04166667

At abs(rho) >= 0.4511517, you have 1 FP and 23 TP, giving you an FDR of 0.0416.. which is below the FDR of 0.05. So you can set your absolute cutoff here.

The example you have is pretty hard to test because both are almost the same null hypothesis with only a different sd. In real life most likely we would need to simulate data to find the correlation we obtain under the null hypothesis. And you will see that the calculation above should work pretty well.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • This is extremely helpful, thank you! The real correlations, I will add, are based on a gene expression matrix and the null comes from permuting values in each sample and recalculating correlation x 100. – Rebecca Eliscu Dec 22 '20 at 23:13
  • I'm looking again at this example you've provided -- would I not set the cutoff for FDR <= 0.05 at abs(rho)=0.4765394 since the next value corresponds to FDR > 0.05? – Rebecca Eliscu Dec 23 '20 at 05:34
  • you would like to include as many as possible. thats the purpose of false discovery. normally it would see the fdr decrease as you increase the threshold.. however given the nature of this data etc.. we see something like this – StupidWolf Dec 23 '20 at 09:06
  • Understood. And some (more) followup q's: 1) Do the null and real distribution sample sizes need to be the same to calculate the FDR as you've described? 2) Generally, I'm having a difficult time conceptually grasping how the correlation value when ratio of the densities (null/real) = 0.05 is different from the correlation value I would obtain using what you've shown here. Would you be able to elaborate on this at all? Many thanks for your help! – Rebecca Eliscu Dec 24 '20 at 06:11
  • @RebeccaEliscu i think those questions you asked are more suitable for cross-validated. https://stats.stackexchange.com/ this is stackoverflow. I have addressed your original question which is to calculate the fdr in R – StupidWolf Dec 24 '20 at 16:08