Writing a function to analyze a subset within a dataframe

Question

I am trying to write a function to aggregate or subset a data frame by a particular column, and then count the proportion of values in another column within that dataframe with certain values.

Specifically, the relevant parts of my data frame, allmutations, look like this:

gennumber   sel  

1          -0.00351647088810292  
1           0.000728499401888683  
1           0.0354633950503043  
1           0.000209700229276244  
2           6.42307549736376e-05  
2          -0.0497259605114181  
2          -0.000371856995145525

Within each generation (gennumber), I would like to count the proportion of values in “sel” that are greater than 0.001, between -0.001 and 0.001, and less than -0.001. Over the entire data set, I've just been doing this:

ben <- allmutations$sel > 0.001      #this is for all generations                
bencount <- length(which(ben==TRUE)) 
totalmu <- length(ben) #             #length(ben) = total # of mutants
tot.pben <- bencount/totalmu         #proportion

What is the best way to do that operation for each value in gennumber? Also, is there an easy way to get proportion of values in the range -0.001 < sel < 0.001? I couldn't figure out how to do it, so I “cheated” and took an absolute value of the column and just looked for values less than 0.001. I can't help but feel there must be a better way though.

Thanks for any help you can give, and please let me know if I can provide any clarification.

dput() of data:

structure(list(gennumber = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), sel = c(-0.00351647088810292, 
0.000728499401888683, 0.0354633950503043, 0.000209700229276244, 
6.42307549736376e-05, -0.0497259605114181, -0.000371856995145525
)), .Names = c("gennumber", "sel"), class = "data.frame", row.names = c(NA, 
-7L))

score 0 · Answer 1 · answered Apr 11 '11 at 17:32

You can combine two logical tests with &, so to test -0.001 < sel < 0.001 you can write sel > -0.001 & sel < 0.001

Here is a way using plyr:

dat <- read.table(tc <- textConnection("
gennumber sel
1 -0.00351647088810292
1 0.000728499401888683
1 0.0354633950503043
1 0.000209700229276244
2 6.42307549736376e-05
2 -0.0497259605114181
2 -0.000371856995145525"), header = TRUE); close(tc)

library("plyr")

ddply(dat,.(gennumber),summarize,
    `sel < -0.001` = sum(sel < -0.001)/length(sel),
    `-0.001 < sel < 0.001` = sum(sel > -0.001 & sel < 0.001)/length(sel),
    `0.001 < sel` = sum(sel > 0.001)/length(sel))

Thank you for the suggestions. That was silly- I was using '&&' instead of '&' - I didn't realize that R made a distinction. — LiY, Apr 13 '11 at 20:30

Gavin Simpson · Accepted Answer · 2011-04-11T17:43:19.300

0

For the first part, assuming your data are in dat, we first split the data by gennumber:

sdat <- with(dat, split(dat, gennumber))

then we write a custom function to do the comparison you want

foo <- function(x, cutoff = 0.001) {
    sum(x[,2] > cutoff) / length(x[,2])
}

and sapply() it over the individual chunks of data in sdat

sapply(sdat, foo)

Which gives:

> sapply(sdat, foo)
   1    2 
0.25 0.00

for this sample of data.

For the second part, we can extend the above function foo() to accept an upper and lower limit and do the computation:

bar <- function(x, upr, lwr) {
    sum(lwr < x[,2] & x[,2] < upr) / length(x[,2])
}

Which gives, [showing how to pass in the extra arguments]

> sapply(sdat, bar, lwr = -0.001, upr = 0.001)
        1         2 
0.5000000 0.6666667

edited Apr 11 '11 at 17:43

answered Apr 11 '11 at 17:37

Gavin Simpson

170,508
25
396
453

1

when x is a logical vector and you need the proportion TRUE, then mean(x) is more elegant (but a bit slower) than sum(x)/length(x) – Thierry Apr 11 '11 at 20:39
Thanks for the help and the explanations! That worked very well. – LiY Apr 13 '11 at 20:31

Writing a function to analyze a subset within a dataframe

2 Answers2

Linked