How to group a dataframe

Question

I have the following data:

structure(list(chr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1), leftPos = c(720916L, 736092L, 818159L, 
4105086L, 4140849L, 4164911L, 4464314L, 4764317L, 4906564L, 5010398L, 
5690705L, 5775286L, 5867036L, 7230474L, 9719183L, 9723741L, 10142816L, 
12733035L, 12742872L, 13996088L), Means1 = c(-4.50166373984263, 
22.7414854787421, 76.5674149543741, 49.6843654244713, -6.76800756009397, 
-5.48616887633702, 171.974765532105, -2.8004419462491, 251.238878920906, 
24.3172106993831, 92.07778585458, -9.2775123433234, 110.959507183586, 
-2.57278963951353, 30.5966726830686, -6.67824575085661, 13.9003103772433, 
99.9034503108899, 11.7786055209302, 57.4114695945089), Means2 = c(-4.50166373984263,
39.4695853682122, 128.134808692772, 140.343227472869, 12.2782402432039, 
4.51728744523118, 103.786232453211, 7.2030143753191, 77.4307425703948, 
17.496276591372, 46.8586391096806, -9.2775123433234, 90.4074077805074, 
18.1392689981969, 32.2700556572673, -6.67824575085661, 17.7360779927443, 
72.2607261926, -5.09768550127924, 55.027412413907), Means3 = c(-4.50166373984263, 
46.7166260656872, 102.330672294296, 91.1594353147049, -6.76800756009397, 
7.86681297424078, 134.793892503456, -2.8004419462491, 103.286968070986, 
15.8436793365245, 59.5942459167218, -9.2775123433234, 114.600293200803, 
21.6610650365489, 19.763233485681, -6.67824575085661, 21.3471182295293, 
101.934037199673, -5.09768550127924, 72.3408810065695), Means4 = c(0, 
49.6962372223769, 132.868329968312, 151.140132674376, 19.0462478032979, 
10.0034563215682, 106.841760397079, 10.0034563215682, 88.6977192399028, 
22.252438424421, 49.7002255511708, 0, 106.856532283882, 20.7120586377105, 
34.6415203071056, 0, 20.7120586377105, 75.2943986695856, 0, 58.1026432784387
)), .Names = c("chr", "leftPos", "Means1", "Means2", "Means3", 
"Means4"), row.names = c(NA, 20L), class = "data.frame")

I would like to group by chr and leftPos, and get a mean of each of the other columns in that group.

I tried the following to do this:

CLL <- function (col) {
  col <- as.data.frame(RawZoutliers %>%
                         group_by(chr, binnum = (leftPos) %/% 500000) %>%
                         summarise(Means = mean(Means)) %>%
                         mutate(leftPos = (binnum+1) * 120000) %>%
                         select(leftPos, Means))
}

RawZoutliersBin<-lapply(RawZoutliers, CLL)

but it was a failure. The error says:

Error during wrapup: no applicable method for 'group_by_' applied to an object of class "c('double', 'numeric')"
Error during wrapup: target context is not on the stack
Error during wrapup: error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': target context is not on the stack)

I'm not sure how to do this

Not a duplicate as I need to group over a certain bin size which the question you've highlighted doesnt do — Sebastian Zeki, Dec 04 '15 at 15:32
Okay. I was starting to formulate a solution, modifying your example. Unfortunately, even if I do `DF %>% mutate(binnum = (leftPos) %/% 500000) %>% group_by(chr, binnum) %>% summarise(Means = mean(Means))`, I get stuck, since you have no `Means` var. Anyway, that's what Cath meant by "create another variable" -- use `mutate` or similar. — Frank, Dec 04 '15 at 15:36

Cath · Accepted Answer · 2015-12-04T15:53:36.910

You can get the mean of every column by chr and leftPos with:

aggregate(. ~ chr + leftPos, data=mydata, FUN=mean)

In your example data, as your leftPos are all distinct, you will get your original data.frame as result.

For example, to get the mean by chr of every column, apart from chr (including leftPos):

 aggregate(. ~ chr, data=mydata, FUN=mean)
 # chr leftPos   Means1   Means2   Means3   Means4
#1   1 6372642 48.75335 41.89019 43.90577 47.82846

EDIT

If you want to do the means by "bins" of leftPos, here is a way with cut

mydata$bin <- cut(mydata$leftPos, breaks=c(seq(min(mydata$leftPos), max(mydata$leftPos), by=500000), max(mydata$leftPos)), include.lowest=T, labels=FALSE)

(you don't have to do a call to cut by chromosome as you are further subtyping your data also by chromosome).

Changing the last 4 chr to 2, you'll get:

aggregate(. ~ chr + bin, data=mydata, FUN=mean)
#   chr bin  leftPos     Means1     Means2     Means3    Means4
#1    1   1   758389  31.602412  54.367577  48.181878  60.85486
#2    1   7  4136949  12.476730  52.379585  30.752747  60.06328
#3    1   8  4464314 171.974766 103.786232 134.793893 106.84176
#4    1   9  4893760  90.918549  34.043345  38.776735  40.31787
#5    1  10  5690705  92.077786  46.858639  59.594246  49.70023
#6    1  11  5821161  50.840997  40.564948  52.661390  53.42827
#7    1  14  7230474  -2.572790  18.139269  21.661065  20.71206
#8    1  18  9719183  30.596673  32.270056  19.763233  34.64152
#9    1  19  9723741  -6.678246  -6.678246  -6.678246   0.00000
#10   2  19 10142816  13.900310  17.736078  21.347118  20.71206
#11   2  25 12737954  55.841028  33.581520  48.418176  37.64720
#12   2  27 13996088  57.411470  55.027412  72.340881  58.10264

How can I specify the size of the grouping eg group by every 500000 leftPos? Is this possible with aggregate? — Sebastian Zeki, Dec 04 '15 at 15:26
@SebastianZeki you can create another variable that you'll use for grouping — Cath, Dec 04 '15 at 15:28
can you show me how? Do I create the variable in the original dataframe? — Sebastian Zeki, Dec 04 '15 at 15:30
@SebastianZeki, e.g. (clearly not optimize): `mydata$bin <- cut(mydata$leftPos, breaks=c(seq(min(mydata$leftPos), max(mydata$leftPos), by=500000), max(mydata$leftPos)), include.lowest=T, labels=FALSE)` and then `aggregate(. ~ chr + bin, data=mydata, FUN=mean)`. — Cath, Dec 04 '15 at 15:45

How to group a dataframe

1 Answers1