1

I have a matrix with a column of different genes and a corresponding column of the -log(P-values) for each possible SNP for every gene.

So the matrix has 3 columns: Gene_lable, SNP and minus_logpval. I'm trying to write a code that identifies the SNP with the highest -log(P-value) for each gene. Here's the head(data):

  SNP           Gene_label           minus_logpval
1 rs3934834 HES4/ENSG00000188290       14.1031
2 rs3766193 HES4/ENSG00000188290        7.0203
3 rs3766192 HES4/ENSG00000188290       10.7420
4 rs3766191 HES4/ENSG00000188290       10.4323
5 rs9442371 HES4/ENSG00000188290       10.2941
6 rs9442372 HES4/ENSG00000188290        8.4235

This is the start of the code:

for(i in 1:254360) {
max_pval = 0
if(data$Gene_label[i]==data$Gene_label[i+1]) {
    x = array(NA, dim=c(0,2));
    x[i] = data$minus_logpval[i];
    x[i+1] = data$minus_logpval[i+1];
    temp = max(x);
    if (temp>max_pval) {
    max_pval=temp
    line = i
    }

But for some reason, R keeps giving me the error: Error in is.ordered(x) : argument "x" is missing, with no default. I didn't even use the is-ordered(x) function... I think the error's in the way I initialized x (which should be an array) but I don't know how to fix it.

zfz
  • 153
  • 3
  • 16
  • 1
    can you please make the code reproducible? – Michele Jun 18 '13 at 09:46
  • hi thanks for replying. not sure what you mean by reproducible, do you mean seeing the whole code? – zfz Jun 18 '13 at 09:52
  • @zfzhao read this post on how to make a [**small reproducible example**](http://stackoverflow.com/q/5963269/1478381). – Simon O'Hanlon Jun 18 '13 at 09:58
  • 1
    oh of course not! :-) just a dummy table that simulate your `data`. Reproducible means a code that, run from the first to the last line, is able to show the same error you get with your real data. Quite often trying to create such reproducible example you find the error by yourself. In this case, if the error is created while executing a line among those above (I guess), probably one of those functions is not receiving the expected the value. Having the possibility to see how `data` looks like would make it easier to help you. – Michele Jun 18 '13 at 10:03
  • hi, i edited my post to include the head(data) of the dataset im working with. i hope it makes it more clear. thanks for any help/advice! – zfz Jun 18 '13 at 11:33

2 Answers2

0

You can try it withou a loop using a tapply

tab <- expand.grid(gene=letters[1:2], SNP=LETTERS[1:3])
tab$minus_logpval <- abs(rnorm(6))*-1
tab <- tab[do.call("order", tab),]
tab$SNP <- as.character(tab$SNP)
with(tab, tapply(minus_logpval, gene, function(x) SNP[which.max(x)]))

HTH

droopy
  • 2,788
  • 1
  • 14
  • 12
0

A perfect use for ddply from plyr. Split the data.frame up into subsets (by Gene_label ) and operate on each piece (find the snp which relates to the max of minus_logpval ):

##  Reproducible example data
set.seed(1234)
df <- data.frame( Gene_label = rep( letters[1:3] , 3 ) , snp = rep( letters[5:7] , each = 3 ) , minus_logpval = rnorm(9) )
df
#  Gene_label snp minus_logpval
#1          a   e    -1.2070657
#2          b   e     0.2774292
#3          c   e     1.0844412
#4          a   f    -2.3456977
#5          b   f     0.4291247
#6          c   f     0.5060559
#7          a   g    -0.5747400
#8          b   g    -0.5466319
#9          c   g    -0.5644520

##  And a single line using 'ddply'
require(plyr)
ddply( df , .(Gene_label) , summarise , SNP = snp[which.max(minus_logpval)] )
#  Gene_label SNP
#1          a   g
#2          b   f
#3          c   e
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • Hi, thanks so a lot. I'm new to R and coding in general, so I really appreciate all of your help. As you can see, I edited my original post to include the head(data). right now, I'm trying to work with: `df = data.frame(data) require(plyr) ddply( df , .(Gene_label) , summarise , SNP = snp[which.max(minus_logpval)] )` but I don't get any outputs from this code. I think the ddply function needs to be called to print the results? do you know how I can do that? thnx! – zfz Jun 18 '13 at 11:35
  • @zfzhao it should be `ddply( df , .(Gene_label) , summarise , SNP = SNP[which.max(minus_logpval)] )` - `snp` does not exist as a column in your data.frame (it is called `SNP`). Don't forget to assign the output to an object if you want to do something with it, e.g. `result <- ddply( df , .(Gene_label) , summarise , SNP = SNP[which.max(minus_logpval)] )` – Simon O'Hanlon Jun 18 '13 at 12:13