0

I have a large dataframe in the same format as d below, with count data for three columns nested within two different factors.

elevation   distance    sp1 sp2 spn
1500    0   2   2   5
1500    0   2   1   5
1500    50  2   2   5
1500    50  2   2   6
2000    0   9   2   5
2000    0   7   2   2
2000    50  4   3   6
2000    50  4   3   4

Notice that there are two replicate rows for each level of factor d$distance.

I would like to aggregate those replicate rows for each distance level within each elevation by summing in each column, so it ends up like this:

elevation   distance    sp1 sp2 spn
1500    0   4   3   10
1500    50  4   4   11
2000    0   16  4   7
2000    50  8   6   10

I can do it easily for one column, eg sp1.
d2 <-data.frame(aggregate(sp1 ~ elevation + distance, data = d, sum))

Can I avoid a for loop to get a new dataframe in the same format that includes all columns sp, sp2, spn? Trying to adapt various other solutions I have seen online have ended in failure because certain bits of my own brain are missing. Thanks.

ptenax
  • 141
  • 1
  • 14
  • [How to make a great reproducible example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) – shayaa Jul 24 '16 at 03:25
  • @shayaa is there something I can add that will help you answer? – ptenax Jul 24 '16 at 03:29

1 Answers1

4

We can use summarise_each from dplyr. It would be fast and efficient.

library(dplyr)
df1 %>%
   group_by(elevation, distance) %>% 
   summarise_each(funs(sum))
#  elevation distance   sp1   sp2   spn
#      <int>    <int> <int> <int> <int>
#1      1500        0     4     3    10
#2      1500       50     4     4    11
#3      2000        0    16     4     7
#4      2000       50     8     6    10

Or another option is data.table

library(data.table)
setDT(df1)[, lapply(.SD, sum) , by = .(elevation, distance)]

A base R approach would be with aggregate would be to use . to specify all the columns except the ones specified in the rhs of ~. But, on a large dataset, this would be slow.

aggregate(.~elevation+distance, df1, sum)
#   elevation distance sp1 sp2 spn
#1      1500        0   4   3  10
#2      2000        0  16   4   7
#3      1500       50   4   4  11
#4      2000       50   8   6  10

NOTE: If there are NA values, use na.rm = TRUE from the sum.

As @user2100721 suggested, we can also use by from base R

by(df1[3:5], df1[1:2], FUN = colSums)

The output will be a list and it can be converted to matrix by rbinding the list elements.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • thanks, very elegant solutions, and so quickly. the `base R` worked well. I had trouble getting the `dpylr` solution to work: `Error: ‘sum’ not meaningful for factors`. Very grateful. – ptenax Jul 24 '16 at 03:39
  • Because akrun's data is not the same as yours. – shayaa Jul 24 '16 at 03:42
  • @ptenax I thought your 'sp1', 'sp2', 'spn' columns are `numeric`. Can you check the `str(df1)`? You may have a `factor` class column. One way would be to convert to numeric and then do sum i.e. `as.numeric(as.character(.))` – akrun Jul 24 '16 at 03:43
  • @ptenax try `df1 %>% group_by(elevation, distance) %>% summarise_each(funs(sum(as.numeric(as.character(.)))))` – akrun Jul 24 '16 at 03:44
  • @akrun yes I did have an additional column in my real data that was a factor. Thanks everyone for comments. – ptenax Jul 24 '16 at 03:45
  • @user2100721 What do you meant by `with` and `by`? You meant `by(df1[3:5], df1[1:2], FUN = sum)` – akrun Jul 24 '16 at 03:47