Summarizing multiple columns by two variables

Question

This is my first time ever using R so please forgive me if this question is not worded properly. I have a .csv file that I imported into R and I am trying to summarize some data. each row of data if for a given year, study site, and area and each column has the number of species present. There are 4 columns for each species as there were 4 surveys where the species could have been seen.

I am trying to get the sum of each species by year and study site. Columns 5:8 are one species, 9:12 another, 13:16 another and so on. Here is the code that I thought would summarize columns 5:8 by year (YYYY) and study area (SAR):

aggregate(test[,5:8],by = list("SAR","YYYY"), FUN = sum, na.rm = TRUE)

This gives me the error message that "argument must have the same length". Can anyone help me through this initial step?

Here is some of the data:

SAR    YYYY GRID_ID WID     col1 col2 col3 col4
BCPALP  2005    1   1189    NA  NA  0   0
BCPALP  2005    1   1190    0   NA  0   0
BCPALP  2005    1   1191    0   0   NA  NA
BCPALP  2005    1   1192    0   NA  NA  NA
BCPALP  2005    1   1194    NA  NA  1   NA
BCPALP  2005    1   1195    NA  NA  1   NA
BCPALP  2005    1   1196    0   NA  0   NA
BCPALP  2005    1   1198    0   NA  0   NA
BCPALP  2005    1   1199    0   NA  0   0

I'm hoping to get an output that is something like this:

SAR    YYYY    total of columns 1:4
BCPALP 2005    2

This is the code I just tried.

aggregate(cbind("col1", "col2", "col3", "col4")~SAR+YYYY, test, FUN=sum, na.rm=TRUE, na.action=NULL)

It gives me an error message that states "variable lengths differ (found for 'SAR')".

I went back and checked the data and all the variable lengths are the same.

BTW, Can you specify `YYYY` and `SAR`s position in the dataset — akrun, Nov 13 '15 at 17:57
Sorry, I am having trouble figuring out how to add a portion of my data table to the comment. The SAR and YYYY are the headings for columns 1 and 2 respectively. — HVS, Nov 13 '15 at 18:03
@HVS My pleasure. It is difficult to get used to SO in the initial stage. — Ronak Shah, Nov 13 '15 at 18:08
@akrun I still get the error message "variable lengths differ (found for 'SAR')" — HVS, Nov 13 '15 at 18:34
@HVS Based on the example you showed, I am not getting any error. — akrun, Nov 13 '15 at 18:34
If you notice my code, I didn't quote the `"col1"`, `"col2"` etc. — akrun, Nov 13 '15 at 18:45

akrun · Accepted Answer · 2015-11-13T19:19:04.367

We can use either aggregate or data.table or dplyr. If we use the formula method for aggregate, we need to set the na.action=NULL when there are NA values in different columns. By default, the na.action=na.omit, so if there is a single NA in one of the columns, that row will be removed from the calculation.

aggregate(cbind(col1, col2, col3, col4)~SAR+YYYY, test,
                        FUN=sum, na.rm=TRUE, na.action=NULL)
#   SAR YYYY col1 col2 col3 col4
#1 BCPALP 2005    0    0    2    0

Using dplyr, we group by 'SAR', 'YYYY', and use summarise_each to get the sum of each of the 'col'.

library(dplyr)
test %>%
     group_by(SAR, YYYY) %>%
     summarise_each(funs(sum=sum(., na.rm=TRUE)), 5:ncol(test))
#     SAR  YYYY  col1  col2  col3  col4
#   (chr) (int) (int) (int) (int) (int)
#1 BCPALP  2005     0     0     2     0

Or with data.table. We convert the 'data.frame' to 'data.table' (setDT(test)), grouped by 'SAR', 'YYYY', we loop though the Subset of Data.table (.SD) and get the sum. The columns to be looped are specified in the .SDcols.

library(data.table)
setDT(test)[, lapply(.SD, sum, na.rm=TRUE), by = .(SAR, YYYY),
             .SDcols= 5:ncol(test)]  
#      SAR YYYY col1 col2 col3 col4
#1: BCPALP 2005    0    0    2    0

Update

Suppose after aggregating we need to get the row wise sum for columns 'col1:col4', then 'col5:col8' etc.

 DT <- setDT(test1)[, lapply(.SD, sum, na.rm=TRUE),
              by = .(SAR, YYYY), .SDcols= 5:ncol(test1)]
 DT1 <- melt(DT, id.var=c('SAR', 'YYYY'))[, i1 := as.numeric(gl(.N, 4, .N)),
            .(SAR, YYYY)]
 dcast(DT1, SAR+YYYY~i1, value.var='value', sum)

data

 test <- structure(list(SAR = c("BCPALP", "BCPALP",
"BCPALP", "BCPALP", 
"BCPALP", "BCPALP", "BCPALP", "BCPALP", "BCPALP"), YYYY = c(2005L, 
2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L),
GRID_ID = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), WID = c(1189L, 1190L, 1191L, 
1192L, 1194L, 1195L, 1196L, 1198L, 1199L), col1 = c(NA, 0L, 0L, 
0L, NA, NA, 0L, 0L, 0L), col2 = c(NA, NA, 0L, NA, NA, NA, NA, 
NA, NA), col3 = c(0L, 0L, NA, NA, 1L, 1L, 0L, 0L, 0L), col4 = c(0L, 
0L, NA, NA, NA, NA, NA, NA, 0L)), .Names = c("SAR", "YYYY",
"GRID_ID", 
"WID", "col1", "col2", "col3", "col4"), class = "data.frame", 
 row.names = c(NA, -9L))

set.seed(24)
m1 <- matrix(sample(c(NA,0:5), 9*4, replace=TRUE),ncol=4, 
           dimnames=list(NULL, paste0('col', 5:8)))
test1 <- cbind(test, m1)

Also, how can I tell R to summarize columns 5:8, 9:12,13:16 and so on until the end of my data? — HVS, Nov 13 '15 at 18:35
@HVS As I have showed, you can specify the column position in `select` with `dplyr` or `.SDcols` in `data.table`. i.e. `.SDcols= 5:ncol(test)`. — akrun, Nov 13 '15 at 18:41
When I copy and paste the data you posted above into R it runs and then I don't get an error message with the aggregate function, The error message must be due to the way the column was created in excel and then converted to a .csv file. I am still a bit unclear on the .SDcols= 5:ncol(test) command. I don't understand how that statement tells R to read the columns I mentioned above. — HVS, Nov 13 '15 at 18:52
@HVS Actually, you got the error with `aggregate` by quoting the `"col1"` etc. We are specifying the position or names of the columns in `.SDcols` and based on that, we loop though those columns (`lapply(.SD, ...`) — akrun, Nov 13 '15 at 18:53
@HVS After aggregating by groups, do you need to take the `sum` rowwise for 5:8, 9:12, etc.. — akrun, Nov 13 '15 at 19:00
@HVS Please check the update. I created a slightly better example to do the rowwise sum. — akrun, Nov 13 '15 at 19:24

Summarizing multiple columns by two variables

1 Answers1

Update

data

Linked