please consider the following example which makes use of aggregate
twice.
library(dplyr)
set.seed(5)
x <- data.frame(
name = sample(c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'), 400, replace = TRUE),
strand = sample(c('+', '-'), 400, replace = TRUE),
value = sample(6, 400, replace = TRUE)
)
x_agg_hist <- aggregate( x$value,
by = list(strand = x$strand,
transcript = x$name
),
function(v) hist( v,
breaks = seq(0.5, 6.5),
plot= FALSE
)$counts
)
y <- data.frame(
name = c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'),
value = runif(5)
)
x_agg_hist$value <- y$value[match(x_agg_hist$transcript, y$name)]
x_agg_hist$division <- ifelse(x_agg_hist$value > 0.5, 1, 2) %>% as.factor()
x_agg_hist
strand transcript x.1 x.2 x.3 x.4 x.5 x.6 value division
1 - NM01 6 9 8 5 5 8 0.5661267 1
2 + NM01 4 2 8 8 8 6 0.5661267 1
3 - NM02 8 4 6 5 3 11 0.1178577 2
4 + NM02 7 6 9 8 7 7 0.1178577 2
5 - NM03 4 5 10 4 6 3 0.2572855 2
6 + NM03 6 10 5 9 5 9 0.2572855 2
7 - NM04 7 4 5 7 4 9 0.9678125 1
8 + NM04 4 3 4 10 8 9 0.9678125 1
9 - NM05 4 6 10 5 5 5 0.8891210 1
10 + NM05 11 13 5 8 12 8 0.8891210 1
So far, everything is fine. Specifically, I notice that I can select the columns of the histograms created by aggregate
"collectively" using
x_agg_hist$x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 6 9 8 5 5 8
[2,] 4 2 8 8 8 6
[3,] 8 4 6 5 3 11
[4,] 7 6 9 8 7 7
[5,] 4 5 10 4 6 3
[6,] 6 10 5 9 5 9
Next, I would like to sum the histograms by 'division' and 'strand' (and normalise by the number of observations in each group).
x_agg_hist_agg_sum <- aggregate( x_agg_hist$x,
by = list(division = x_agg_hist$division,
strand = x_agg_hist$strand
),
function(v) sum(v)/length(v)
)
Note that using x_agg_hist$x
to select all the columns of the histograms seems a lot more convenient than what has been proposed here (Aggregate / summarize multiple variables per group (e.g. sum, mean)).
This still works as expected.
x_agg_hist_agg_sum
division strand V1 V2 V3 V4 V5 V6
1 1 - 5.666667 6.333333 7.666667 5.666667 4.666667 7.333333
2 2 - 6.000000 4.500000 8.000000 4.500000 4.500000 7.000000
3 1 + 6.333333 6.000000 5.666667 8.666667 9.333333 7.666667
4 2 + 6.500000 8.000000 7.000000 8.500000 6.000000 8.000000
However, now aggregate
has renamed the columns of the (summed) histograms in a way that does not allow selecting them collectively any more. Therefore, I was wondering if it was possible to tell aggregate to keep the original column names and structure or if there is any other method that can do so. (Of course I know that I can use x_agg_hist_agg_sum[, -c(1, 2)]
, but with my real data (after a lot of further processing) this would at least be a lot more difficult.)
Cheers,
mce1