7

Was looking for a easy way to get the sum total of around 170 different columns in my data table as a new column. Is there some wild card notation I can use?

This following is a small extract (first few columns) from my dataset:

> head(t_checkin)
checkin_info_0.0 checkin_info_0.1 checkin_info_0.2 checkin_info_0.3 checkin_info_0.4 checkin_info_0.5
              NA               NA               NA               NA               NA                1
               3               NA               NA               NA               NA                1
              NA               NA                1               NA               NA               NA
              NA               NA               NA               NA               NA               NA
              NA               NA               NA               NA               NA               NA
              NA               NA               NA               NA               NA               NA

Any help would be greatly appreciated. Thanks

Abi K
  • 631
  • 2
  • 7
  • 13
  • try `?colSums` function – Nishanth Apr 09 '13 at 14:53
  • look into `na.rm` argument - depending on how you to handle missing values – Nishanth Apr 09 '13 at 14:54
  • Thanks for the info. I want to omit the NA values, therefore I guess I can use something like colSums(t_checkin, na.rm="False") but I have another column in my dataset which is string. How can I omit that first column as well? – Abi K Apr 09 '13 at 15:04
  • @oostopitre Grouping columns are already excluded from `.SD` but to exclude others use `.SDcols`. – Matt Dowle Apr 09 '13 at 15:17

1 Answers1

9

EDIT

Correction, do what @MatthewDowle says:

dat <- data.frame(x=1:11,y=100:110,z=sample(letters,11))
DT <- as.data.table(dat)
names <- c("x","y")
DT[,lapply(.SD,sum),.SDcols=names]


mysum <- function(x){sum(x, na.rm=TRUE)}
DT[,lapply(.SD,mysum),.SDcols=names]
user1317221_G
  • 15,087
  • 3
  • 52
  • 78
  • `colsums` isn't the best way in `data.table`. There is an example of `lapply(.SD,...)` in `?data.table`, which is the intended way. Often used with `.SDcols` parameter. – Matt Dowle Apr 09 '13 at 15:09
  • So I ended up getting what I want using the following syntax by @user1317221_G: `t_checkin$totCheckin<-rowSums(t_checkin[,2:169],na.rm=TRUE)`. I needed the rowSums. I will checkout the lapply function. – Abi K Apr 09 '13 at 15:31
  • I may have missed something then if `rowSums` is needed. Aside: instead of `mysum` could it be `DT[,lapply(.SD,sum,na.rm=TRUE),.SDcols=names]`? – Matt Dowle Apr 09 '13 at 15:52
  • One final question. So if I do the lapply as suggested above it is giving me the colSum for each column, I need the rowSums. Can lapply still be used? How do I switch it to rowSums. – Abi K Apr 09 '13 at 15:58
  • 1
    @oostopitre If you need the row sums then that wasn't at all clear from the question which says "sum of columns". `lapply(.SD` is just for column sums. Row sums is quite different animal from a memory and efficiency point of view; `data.table` doesn't offer anything better than `rowSums` for that, currently. – Matt Dowle Apr 09 '13 at 16:05
  • 2
    Thank you for that info. That is very helpful information actually. Sorry for the confusing title. Changed the title. – Abi K Apr 09 '13 at 16:07
  • @oostopitre is the answer to your question that you should use rowsums? These comments and the answer seem to imply two different things. If that's the case will you answer your own question to mark the correct answer. – Steve Bronder May 31 '16 at 15:55