1

I am trying to write a function to aggregate or subset a data frame by a particular column, and then count the proportion of values in another column within that dataframe with certain values.

Specifically, the relevant parts of my data frame, allmutations, look like this:

gennumber   sel  

1          -0.00351647088810292  
1           0.000728499401888683  
1           0.0354633950503043  
1           0.000209700229276244  
2           6.42307549736376e-05  
2          -0.0497259605114181  
2          -0.000371856995145525  

Within each generation (gennumber), I would like to count the proportion of values in “sel” that are greater than 0.001, between -0.001 and 0.001, and less than -0.001. Over the entire data set, I've just been doing this:

ben <- allmutations$sel > 0.001      #this is for all generations                
bencount <- length(which(ben==TRUE)) 
totalmu <- length(ben) #             #length(ben) = total # of mutants
tot.pben <- bencount/totalmu         #proportion

What is the best way to do that operation for each value in gennumber? Also, is there an easy way to get proportion of values in the range -0.001 < sel < 0.001? I couldn't figure out how to do it, so I “cheated” and took an absolute value of the column and just looked for values less than 0.001. I can't help but feel there must be a better way though.

Thanks for any help you can give, and please let me know if I can provide any clarification.

dput() of data:

structure(list(gennumber = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), sel = c(-0.00351647088810292, 
0.000728499401888683, 0.0354633950503043, 0.000209700229276244, 
6.42307549736376e-05, -0.0497259605114181, -0.000371856995145525
)), .Names = c("gennumber", "sel"), class = "data.frame", row.names = c(NA, 
-7L))
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
LiY
  • 13
  • 3

2 Answers2

0

You can combine two logical tests with &, so to test -0.001 < sel < 0.001 you can write sel > -0.001 & sel < 0.001

Here is a way using plyr:

dat <- read.table(tc <- textConnection("
gennumber sel
1 -0.00351647088810292
1 0.000728499401888683
1 0.0354633950503043
1 0.000209700229276244
2 6.42307549736376e-05
2 -0.0497259605114181
2 -0.000371856995145525"), header = TRUE); close(tc)

library("plyr")

ddply(dat,.(gennumber),summarize,
    `sel < -0.001` = sum(sel < -0.001)/length(sel),
    `-0.001 < sel < 0.001` = sum(sel > -0.001 & sel < 0.001)/length(sel),
    `0.001 < sel` = sum(sel > 0.001)/length(sel))
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
  • Thank you for the suggestions. That was silly- I was using '&&' instead of '&' - I didn't realize that R made a distinction. – LiY Apr 13 '11 at 20:30
0

For the first part, assuming your data are in dat, we first split the data by gennumber:

sdat <- with(dat, split(dat, gennumber))

then we write a custom function to do the comparison you want

foo <- function(x, cutoff = 0.001) {
    sum(x[,2] > cutoff) / length(x[,2])
}

and sapply() it over the individual chunks of data in sdat

sapply(sdat, foo)

Which gives:

> sapply(sdat, foo)
   1    2 
0.25 0.00

for this sample of data.

For the second part, we can extend the above function foo() to accept an upper and lower limit and do the computation:

bar <- function(x, upr, lwr) {
    sum(lwr < x[,2] & x[,2] < upr) / length(x[,2])
}

Which gives, [showing how to pass in the extra arguments]

> sapply(sdat, bar, lwr = -0.001, upr = 0.001)
        1         2 
0.5000000 0.6666667
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 1
    when x is a logical vector and you need the proportion TRUE, then mean(x) is more elegant (but a bit slower) than sum(x)/length(x) – Thierry Apr 11 '11 at 20:39
  • Thanks for the help and the explanations! That worked very well. – LiY Apr 13 '11 at 20:31