0

I have some data containing individual species and their counts in each sample. Each species is also labelled by its trophic mode. It looks like this:

OTU_ID Trophic.Mode  Sample1 Sample2 Sample3 Sample4
 gatca   Symbiotroph   4         5      6      1
 atca    Pathotroph    4         3      4      4
 gatac   Pathotroph    7         1      2      1

And so on, it has 2700 rows, of species counts with their trophic mode labelled. I'd like to aggregate the data so that I have a simpler functional community per sample, by the column Trophic.Mode (which has 5 factors, so I'm expecting for a dataframe with 5 rows), so that I end up with this:

Trophic.Mode  Sample1 Sample2 Sample3 Sample4
  Symbiotroph   4         5      6      1
  Pathotroph    13        18     15     11

I've consequently used the following code:

agg = aggregate(data,
            by = list(data$Trophic.Mode),
            FUN = sum)

This just returns an error

"Error in Summary.factor(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,  : 
  ‘sum’ not meaningful for factors

But I'm not sure I understand - I want to sum the other columns by those factors.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Will
  • 394
  • 1
  • 3
  • 8
  • 2
    It is trying to sum all the column in the `data` argument. You need to tell `aggregate` it doesn't need to sum the `OTU_ID` or `Trophic.Mode` columns - which are factors. Change `data` to `data[-(1:2)]` to omit the first two column. (EDIT: need to drop first 2 columns) – Gregor Thomas Oct 15 '20 at 14:56
  • Thanks, Gregor that worked and I'm a little embarrassed I didn't realise that – Will Oct 15 '20 at 15:10
  • 1
    *"I'm a little embarrassed I didn't realise that"* Don't be. That's why there are hundreds of thousands of R questions on this site. – Gregor Thomas Oct 15 '20 at 15:23

6 Answers6

3

I prefer the formula syntax for aggregate, because grouping columns on the right hand side of the formula will automatically be omitted from the summary:

aggregate(. ~ Trophic.Mode, data = data[-1], FUN = sum)

The non-formula interface with by is mostly useful if you want to group by something that's not a column in your data.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
3

I want to sum the other columns by those factors.

This is what rowsum is for:

rowsum(data[, -(1:2)], data$Trophic.Mode)
#R>             Sample1 Sample2 Sample3 Sample4
#R> Pathotroph       11       4       6       5
#R> Symbiotroph       4       5       6       1

This is hard to beat both in terms of keystrokes and computation time:

# simulate your ~2700 row data set
set.seed(1)
n <- 2700L
sim_dat <- data.frame(
  Trophic.Mode = sample.int(5, n, replace = TRUE), 
  sample       = matrix(sample.int(10, n * 4L, replace = TRUE), n))
colnames(sim_dat)[-1] <- paste0("sample", 1:4)
head(sim_dat, 3)
#R>   Trophic.Mode sample1 sample2 sample3 sample4
#R> 1            1       9       6       9       6
#R> 2            4       1       6       3       1
#R> 3            1       9       9      10       9

# check that we get the same
r1 <- aggregate(. ~ Trophic.Mode, data = sim_dat, FUN = sum)
r2 <- rowsum(sim_dat[, -1], sim_dat$Trophic.Mode)
all.equal(r1[, -1], r2, check.attributes = FALSE)
#R> [1] TRUE

library(tidyverse)
r3 <- sim_dat %>% 
  group_by(Trophic.Mode) %>%
  summarise_all(sum) %>%
  ungroup()
all.equal(r3[, -1], r2, check.attributes = FALSE)
#R> [1] TRUE

# check the computation time
bench::mark(
  aggregate = aggregate(. ~ Trophic.Mode, data = sim_dat, FUN = sum), 
  rowsum    = rowsum(sim_dat[, -1], sim_dat$Trophic.Mode), 
  tidy      = sim_dat %>% 
    group_by(Trophic.Mode) %>%
    summarise_all(sum) %>%
    ungroup(),
  min_time = 2, check = FALSE)
#R> # A tibble: 3 x 13
#R>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time 
#R>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> 
#R> 1 aggregate    2.28ms   2.45ms      396.    1.37MB    17.4    612    27      1.55s 
#R> 2 rowsum     110.93µs  129.8µs     7449.   53.23KB    11.2   9985    15      1.34s 
#R> 3 tidy         3.49ms   4.63ms      209.   93.41KB     6.56   383    12      1.83s

That is almost 20 times faster than the other methods but a few milliseconds is nothing to start with...

Update

As requested, here is the benchmark with n <- 500000L (500k rows) and without the ungroup:

bench::mark(
  aggregate = aggregate(. ~ Trophic.Mode, data = sim_dat, FUN = sum), 
  rowsum    = rowsum(sim_dat[, -1], sim_dat$Trophic.Mode), 
  tidy      = sim_dat %>% 
    group_by(Trophic.Mode) %>%
    summarise_all(sum),
  min_time = 2, check = FALSE)
#R> # A tibble: 3 x 13
#R>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time 
#R>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> 
#R> 1 aggregate   203.5ms  264.5ms      3.64  234.56MB     25.5     8    56      2.19s 
#R> 2 rowsum         11ms   11.2ms     84.8     7.82MB     14.0   170    28         2s 
#R> 3 tidy         12.8ms   13.4ms     64.8    17.36MB     25.9   130    52         2s

and for n <- 10000000L (10M):

bench::mark(
  aggregate = aggregate(. ~ Trophic.Mode, data = sim_dat, FUN = sum), 
  rowsum    = rowsum(sim_dat[, -1], sim_dat$Trophic.Mode), 
  tidy      = sim_dat %>% 
    group_by(Trophic.Mode) %>%
    summarise_all(sum),
  min_time = 30, check = FALSE)
#R>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time 
#R>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> 
#R> 1 aggregate     5.82s    5.92s     0.168    5.05GB    2.18      6    78      35.8s 
#R> 2 rowsum     275.46ms 285.37ms     3.41   204.29MB    0.796   103    24      30.2s 
#R> 3 tidy       233.38ms 264.47ms     3.66   331.03MB    2.03    110    61      30.1s

It turns out though that there is very little difference for larger data sets.

  • If you're going to run benchmarks, please do it on a version of the data big enough to matter - I'd suggest at least 500k rows. And if you are going to benchmark (if performance matters at all) you should throw in `data.table` as well. – Gregor Thomas Oct 15 '20 at 15:27
  • I'd also get rid of the `ungroup()` - summarize auto-ungroups. – Gregor Thomas Oct 15 '20 at 15:28
  • The OP state the there are 2700 rows. However, I will run it with 500k in an updated version. – Benjamin Christoffersen Oct 15 '20 at 15:32
  • Yes, with 2700 rows the 20x speed-up is... 2.32 milliseconds. OP would need to run the code hundreds of thousands of times to save as much compute time as it would take to read this answer and change the code. Unless OP is working a highly tuned production system, the performance differences are irrelevant at 2700 rows. However, if the data gets big enough where performance might matter, the rankings and relative performance might change. – Gregor Thomas Oct 15 '20 at 15:38
  • Though this question was just seeking help with the `aggregate` syntax. If you do bother to run useful benchmarks I'd suggest adding as an answer to the [R-FAQ on how to sum by group](https://stackoverflow.com/q/1660124/903061) - it would be a valuable contribution there and be a lot more visible. To make a really good answer there, I'd suggest running benchmarks at 100k, 1M, and 10M rows with a large number of unique groups. – Gregor Thomas Oct 15 '20 at 15:39
  • [This fantastic answer at the mean by group FAQ](https://stackoverflow.com/a/11562850/903061) would make a good template. Your results would probably match quite well, but of course with the addition of `rowsum`. – Gregor Thomas Oct 15 '20 at 15:42
  • My initial point was that the OP state that he/she wants _want to sum the other columns by those factors_. This is what `rowsum` is for. For the data set at hand it seems like it requires few keystrokes and it fast for this size of a data set that is mentioned. It turns out to matter less for larger data set is seems. – Benjamin Christoffersen Oct 15 '20 at 15:48
  • 1
    I like the `rowsum` solution - it is the nicest (though least general) `base` way to solve this problem. My point is that *all the proposed methods are fast* for data this size - the slowest method takes a fraction of a second. New R users should focus on understanding methods and readability of code, not premature optimization. Even at 500k rows the difference just becomes noticeable. – Gregor Thomas Oct 15 '20 at 15:53
  • Yes you really need to do this a lot of times before it start to matter in terms of computation time. – Benjamin Christoffersen Oct 15 '20 at 16:00
  • It is interested - the `tidy` vs `rowsum` differences are small, but your most recent update suggests that `tidy` scales better as the data gets longer. I wonder if that's also true as the number of groups increase (though if you had 10M rows and, say > 100k groups, nothing will compete with `data.table`). – Gregor Thomas Oct 15 '20 at 16:27
  • Yes, I thought that the `rowsum` would keep being fast but apparently this is not the case. I also suspect that the a `data.table` solution is faster. – Benjamin Christoffersen Oct 15 '20 at 18:19
2
library(tidyverse)

df %>% 
select(-OTU_ID) %>%
group_by(Trophic.Mode) %>%
summarise_all(sum) %>%
ungroup()

My preference is tidyverse.

akash87
  • 3,876
  • 3
  • 14
  • 30
  • 2
    To help OP get started, instead of broadly recommending `tidyverse` I'd point OP directly to `dplyr` - which is all that is needed here. (And `summarise_all` is on it's way out, `summarise(across(everything(), sum))` is the modern way. Which is nice because instead of manually `select(-OTU_ID)`, we could switch to `summarise(across(where(is.numeric), sum))`. – Gregor Thomas Oct 15 '20 at 15:03
2

Maybe try this option. It is a base R option. And as said in comments by @GregorThomas you have to tell the function which variables must be considered. In this case in the formula we avoid OTU_ID. Here the code:

#Option 1
Out1 <- aggregate(cbind(Sample1,Sample2,Sample3,Sample4)~Trophic.Mode,data=df,sum,na.rm=T)

Output:

  Trophic.Mode Sample1 Sample2 Sample3 Sample4
1   Pathotroph      11       4       6       5
2  Symbiotroph       4       5       6       1

Some data used:

#Data
df <- structure(list(OTU_ID = c("gatca", "atca", "gatac"), Trophic.Mode = c("Symbiotroph", 
"Pathotroph", "Pathotroph"), Sample1 = c(4L, 4L, 7L), Sample2 = c(5L, 
3L, 1L), Sample3 = c(6L, 4L, 2L), Sample4 = c(1L, 4L, 1L)), class = "data.frame", row.names = c(NA, 
-3L))
Duck
  • 39,058
  • 13
  • 42
  • 84
1

try this one

library(dplyr)
read_table("TU_ID  Trophic.Mode  Sample1 Sample2 Sample3 Sample4
  gatca   Symbiotroph   4         5      6      1
  atca    Pathotroph    4         3      4      4
  gatac   Pathotroph    7         1      2      1") %>% 
   group_by(Trophic.Mode) %>% 
   summarise(across(where(is.numeric), sum))

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 5
  Trophic.Mode Sample1 Sample2 Sample3 Sample4
  <chr>          <dbl>   <dbl>   <dbl>   <dbl>
1 Pathotroph        11       4       6       5
2 Symbiotroph        4       5       6       1
jyjek
  • 2,627
  • 11
  • 23
1

Consider the formula version of aggregate which arguably can be more readable since you can explicitly see the numeric and grouping columns. But you may still need to filter for formula columns.

Also do note aggregate formula vs. non-formula methods can differ in handling missing values. See aggregate methods treat missing values (NA) differently. @Rorschach's fix applied below.

# EXPLICIT NUMERIC COLUMNS
agg <- aggregate(cbind(Sample1, Sample2, Sample3, Sample4) ~ Trophic.Mode,
                 data = data,
                 FUN = sum,
                 na.rm = TRUE, 
                 na.action = na.pass)

# IMPLICIT "ALL" NUMERIC COLUMNS
agg <- aggregate(. ~ Trophic.Mode,
                 data = data[,grep("Trophic|Sample", names(data))],
                 na.rm = TRUE, 
                 na.action = na.pass)
Parfait
  • 104,375
  • 17
  • 94
  • 125