21

I have a large data table (from the package data.table) with over 60 columns (the first three corresponding to factors and the remaining to response variables, in this case different species) and several rows corresponding to the different levels of the treatments and the species abundances. A very small version looks like this:

library(data.table)
TEST <- data.table(Time=c("0","0","0","7","7","7","12"),
             Zone=c("1","1","0","1","0","0","1"),
             quadrat=c(1,2,3,1,2,3,1),
             Sp1=c(0,4,29,9,1,2,10),
             Sp2=c(20,17,11,15,32,15,10),
             Sp3=c(1,0,1,1,1,1,0))

setkey(TEST,Time)
TEST

#    Time Zone quadrat Sp1 Sp2 Sp3
# 1:    0    1       1   0  20   1
# 2:    0    1       2   4  17   0
# 3:    0    0       3  29  11   1
# 4:   12    1       1  10  10   0
# 5:    7    1       1   9  15   1
# 6:    7    0       2   1  32   1
# 7:    7    0       3   2  15   1

I first want to calculate the mean abundances of each species across Time for each Zone x quadrat combination and that's fine:

Abundance = TEST[ , lapply(.SD, mean), by = "Zone,quadrat"]
Abundance
#    Zone quadrat Time       Sp1  Sp2       Sp3
# 1:   Z1       1   NA  6.333333 15.0 0.6666667
# 2:   Z1       2   NA  2.500000 24.5 0.5000000
# 3:   Z0       1   NA 15.500000 13.0 1.0000000  

Then I want to calculate rowwise sum for the 'species' columns, in the example from Sp1 to Sp3. I have tried the following code with no success:

Abundance$SumAbundance <- rowSums(Abundance[ , c(4:6)])  

I get the error message:

# Error in rowSums(Abundance[, c(4:6)]) : 
# 'x' must be an array of at least two dimensions

How can I compute row sums for specific columns of a data.table?

Henrik
  • 65,555
  • 14
  • 143
  • 159
Claire G
  • 313
  • 1
  • 2
  • 7
  • `rowSums` would now work as `Abundance[ , 4:6]` no longer needs `with=FALSE` -- `4:6` in `j` will do the column subsetting as expected – MichaelChirico Feb 15 '20 at 04:42

2 Answers2

36

[ Edited 2020-02-15 to reflect current state of data.table ] In recent versions of data.table rowSums(Abundance[ , 4:6]) works as OP originally expected. Here are some alternatives:

Abundance[, SumAbundance := rowSums(.SD), .SDcols = 4:6]

Also, I didn't check, but I have a suspicion this will be faster, since it will not convert to matrix as rowSums does:

Abundance[, SumAbundance := Reduce(`+`, .SD), .SDcol = 4:6]
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Hi @eddi, that's great it works. Could you explain why the original version (Abundance[, c(4:6)]) did that? – Claire G Feb 18 '14 at 16:21
  • 1
    @ClaireG see point `1.1` in the [FAQ](http://datatable.r-forge.r-project.org/datatable-faq.pdf) – eddi Feb 18 '14 at 16:22
  • @eddi, how to work around if some of the values are NA and you want to get rid of them in the reduce version – Gaurav Singhal Mar 27 '17 at 13:04
  • @GauravSinghal you can, but it's not going to be pretty or fast and I'd just use `rowSums` in that case – eddi Mar 28 '17 at 21:41
  • @eddi, I found an answer here by arun http://stackoverflow.com/a/29270493/4613606, I don't know if its slow, but not difficult to understand for a naive like me – Gaurav Singhal Mar 29 '17 at 09:04
  • 2
    @GauravSinghal simply do `rowSums(..., na.rm = T)`. I don't think that the `Reduce` solution offers any advantages when there are NA's. Without NA's it can be faster than `rowSums`, and is roughly as compact. – eddi Mar 29 '17 at 17:50
  • Thanks for that, so trade off is either converting the data.table to matrix V/s to check all values in each column for NA and replace them by 0. I think the exact times will depend on dimensions of data.table. My guess is: if number of columns is very high then rowSums could be faster, or if number of rows are very high then Reduce could be faster. Not sure though. – Gaurav Singhal Mar 30 '17 at 08:12
8

An alternative (data.table) approach would be to store your data in long form. Version 1.8.11 of data.table has fast melt and dcast methods

library(reshape2)
mt <- melt(test, id=1:3,variable.name='Species')

abundance <- mt[,list(abundance = mean(value)),by=list(Zone,quadrat,Species)][, 
                sumAbundance := sum(abundance), by = list(Zone,quadrat)]

Working in long format will take a slight change in thinking, but it may end up being more efficient memory wise (as less internal copying will be involved, and you are referencing a single not multiple elements within every "by" group.)

mnel
  • 113,303
  • 27
  • 265
  • 254
  • I'm afraid this does not work and I'm not sure it's the best way anyways as I have 63 species (variables) in my actual dataset... – Claire G Feb 19 '14 at 15:30
  • ok, after struggling to update to version 1.8.11 of data.table (sorry...), the commands do work but it's not what I want to do: I want to sum the species means, so the commands suggested above by @eddi are the way for me. – Claire G Feb 19 '14 at 16:27
  • @ClaireG --I've altered my example. Working in "long" form will take a bit of getting used to. – mnel Feb 19 '14 at 22:26