5

I have a data frame that its columns are different samples of an experiment. I wanted to find the correlation between these samples. So the correlation between sample v2 and v3, between sample v2 and v4, .... This is the data frame:

> head(t1)
      V2          V3          V4         V5         V6
1 0.12725011 0.051021886 0.106049328 0.09378767 0.17799444
2 0.86096784 1.263327211 3.073650624 0.75607466 0.92244361
3 0.45791031 0.520207274 1.526476608 0.67499102 0.49817761
4 0.00000000 0.001139721 0.003158557 0.00000000 0.00000000
5 0.13383965 0.098943019 0.099922146 0.13871867 0.09750611
6 0.01016334 0.010187671 0.025410170 0.00000000 0.02369374
> nrow(t1)
[1] 23367

if I run the cor function for this data frame to get the correlation between samples(columns) I get NA for all the samples:

> cor(t1, method= "spearman")
V2 V3 V4 V5 V6
V2  1 NA NA NA NA
V3 NA  1 NA NA NA
V4 NA NA  1 NA NA
V5 NA NA NA  1 NA
V6 NA NA NA NA  1

but if I run this :

> cor.test(t1[,1],t1[,2], method="spearman")$estimate
rho 
0.92394 

it is different. Why is this so? What is the correct way of getting correlation between these samples? Thank you in advance.

hora
  • 845
  • 5
  • 14
  • 25

1 Answers1

6

Your data contains NA values.

From ?cor:

If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.

From ?cor.test

na.action a function which indicates what should happen when the data contain NAs. Defaults to getOption("na.action").

On my system:

getOption("na.action")
[1] "na.omit"

Use which(!is.finite(t1)) to search for problematic values and which(is.na(t1)) to search for NA values. cor returns NaN if you have Inf values in your data.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • how can I check if my data frame contains NA or not? I think it includes plenty of Inf values as well. Will it also affect? And another question is that I think cor.test is for pairwise correlation, and it needs two parameter for calculating. I think what I should use is cor not cor.test, but I am still not sure if it is the correct function to find the correlation between samples(columns) of a data frame or not. – hora Feb 02 '13 at 11:04
  • @hora See my edit to the answer and read the help pages. You can use `*apply` functions to do pairwise comparisons with `cor.test`. – Roland Feb 02 '13 at 11:13
  • Thank @Roland. Now I check my data frame, actually only one row has NA which I think only the values which are related to that row should be NA but not all of them. I also replaced the Inf values but the result is still NA. Actually my question is that why when I use cor.test with the same data set comparing only two samples, the result is not NA. But when I use this "cor" for the whole data frame I get NA. :( – hora Feb 02 '13 at 12:00
  • @hora we can throw guesses at this all day long. Show us the data or reproduce your problem with a simple reproducible example. – Roman Luštrik Feb 02 '13 at 12:10
  • @hora You are not correlating rows, but columns. Please try to understand what I have written and read the help pages carefully. – Roland Feb 02 '13 at 12:11
  • Sorry Roland that I am a bit confused. Ofcourse I am calculating for the columns otherwise the row and column names of the answer for "cor" command above would not be the main data frame column names. Actually I run the command with "use"parameter like: cor(q1,method="spearman",use="pairwise.complete.obs") and now the correlations are not NAs. Do you think it is the correct parameter I should use? @Roman Luštrik how should I provide the data? it is a very big matrix with 23367 rows. Is there the possibility to add files here? I can not find it! – hora Feb 02 '13 at 14:34
  • @hora you would generally upload a subset of your data to a third party site and link to it. Or you can make a mock example (see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Roman Luštrik Feb 02 '13 at 16:02