0

please consider the following example which makes use of aggregate twice.

library(dplyr)
set.seed(5)

x <- data.frame(
                name = sample(c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'), 400, replace = TRUE),
                strand = sample(c('+', '-'), 400, replace = TRUE),
                value = sample(6, 400, replace = TRUE)
                )

x_agg_hist <- aggregate(    x$value, 
                            by = list(strand = x$strand, 
                                      transcript = x$name
                                      ), 
                            function(v) hist(   v, 
                                                breaks = seq(0.5, 6.5),
                                                plot= FALSE
                                                )$counts
                            )
                                
y <- data.frame(
                name = c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'),
                value = runif(5)
                )
                
x_agg_hist$value <- y$value[match(x_agg_hist$transcript, y$name)]

x_agg_hist$division <- ifelse(x_agg_hist$value > 0.5, 1, 2) %>% as.factor()


x_agg_hist
   strand transcript x.1 x.2 x.3 x.4 x.5 x.6     value division
1       -       NM01   6   9   8   5   5   8 0.5661267        1
2       +       NM01   4   2   8   8   8   6 0.5661267        1
3       -       NM02   8   4   6   5   3  11 0.1178577        2
4       +       NM02   7   6   9   8   7   7 0.1178577        2
5       -       NM03   4   5  10   4   6   3 0.2572855        2
6       +       NM03   6  10   5   9   5   9 0.2572855        2
7       -       NM04   7   4   5   7   4   9 0.9678125        1
8       +       NM04   4   3   4  10   8   9 0.9678125        1
9       -       NM05   4   6  10   5   5   5 0.8891210        1
10      +       NM05  11  13   5   8  12   8 0.8891210        1

So far, everything is fine. Specifically, I notice that I can select the columns of the histograms created by aggregate "collectively" using

x_agg_hist$x
      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]    6    9    8    5    5    8
 [2,]    4    2    8    8    8    6
 [3,]    8    4    6    5    3   11
 [4,]    7    6    9    8    7    7
 [5,]    4    5   10    4    6    3
 [6,]    6   10    5    9    5    9

Next, I would like to sum the histograms by 'division' and 'strand' (and normalise by the number of observations in each group).

x_agg_hist_agg_sum <- aggregate(    x_agg_hist$x,
                                    by = list(division = x_agg_hist$division,
                                              strand = x_agg_hist$strand
                                              ), 
                                    function(v) sum(v)/length(v)
                                    )
                                

Note that using x_agg_hist$x to select all the columns of the histograms seems a lot more convenient than what has been proposed here (Aggregate / summarize multiple variables per group (e.g. sum, mean)).

This still works as expected.

x_agg_hist_agg_sum
  division strand       V1       V2       V3       V4       V5       V6
1        1      - 5.666667 6.333333 7.666667 5.666667 4.666667 7.333333
2        2      - 6.000000 4.500000 8.000000 4.500000 4.500000 7.000000
3        1      + 6.333333 6.000000 5.666667 8.666667 9.333333 7.666667
4        2      + 6.500000 8.000000 7.000000 8.500000 6.000000 8.000000

However, now aggregate has renamed the columns of the (summed) histograms in a way that does not allow selecting them collectively any more. Therefore, I was wondering if it was possible to tell aggregate to keep the original column names and structure or if there is any other method that can do so. (Of course I know that I can use x_agg_hist_agg_sum[, -c(1, 2)], but with my real data (after a lot of further processing) this would at least be a lot more difficult.)

Cheers,

mce1

mce1
  • 13
  • 5

1 Answers1

0

I would suggest to use dplyr for such long chained operations. There are lot of benefits with it.

You can do all the transformation/manipulation and reshaping code with it in the single pipe without creating intermediate variables like x_agg_hist and x_agg_hist_agg_sum. So you don't have to remember/manage them.

The first few steps of your code code can be translated as :

library(dplyr)

x %>%
  group_by(strand, name) %>%
  summarise(res = hist(value, breaks = seq(0.5, 6.5),plot= FALSE)$counts) %>%
  left_join(y, by = 'name') %>%
  mutate(division = factor(ifelse(value > 0.5, 1, 2)))  %>%
  ungroup 

Use pivot_wider to cast the data into wide format which will maintain the names of the data.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • This looks very neat! However, I don't seem to be able to figure out how to sum the lists elementwise after %>% pivot_wider(names_from = strand, values_from = res) %>% group_by(division). Casting with as.data.frame does't seem to do the trick. Any suggestions? – mce1 Feb 10 '21 at 20:28
  • You might do `distinct() %>% pivot_wider(names_from = strand, values_from = res, values_fn = sum)` – Ronak Shah Feb 11 '21 at 01:37