Using `:=` in data.table to sum the values of two columns in R, ignoring NAs

Question

I have what I think is a very simple question related to the use of data.table and the := function. I don't think I quite understand the behaviour of := and often I run into similar problems.

Here is some example data

 mat <- structure(list(
              col1 = c(NA, 0, -0.015038, 0.003817, -0.011407), 
              col2 = c(0.003745, 0.007463, -0.007407, -0.003731, -0.007491)), 
              .Names = c("col1", "col2"), 
              row.names = c(NA, 10L), 
              class = c("data.table", "data.frame"))

which gives

> mat
         col1      col2
 1:        NA  0.003745
 2:  0.000000  0.007463
 3: -0.015038 -0.007407
 4:  0.003817 -0.003731
 5: -0.011407 -0.007491

I want to create a column called col3 which gives the sum of col1 and col2. If I use

mat[,col3 := col1 + col2]

#        col1      col2      col3
#1:        NA  0.003745        NA
#2:  0.000000  0.007463  0.007463
#3: -0.015038 -0.007407 -0.022445
#4:  0.003817 -0.003731  0.000086
#5: -0.011407 -0.007491 -0.018898

then I get an NA for the first row, but I want NAs to be ignored. So I tried instead

mat[,col3 := sum(col1,col2,na.rm=TRUE)]

#        col1      col2      col3
#1:        NA  0.003745 -0.030049
#2:  0.000000  0.007463 -0.030049
#3: -0.015038 -0.007407 -0.030049
#4:  0.003817 -0.003731 -0.030049
#5: -0.011407 -0.007491 -0.030049

which is not what I am after, since it is giving me the sum of all elements of col1 and col2. I think I don't quite get :=... How can I get the sum of the element of col1 and col2 ignoring NA values?

Not sure this is relevant, but here is my sessionInfo

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.3

Perhaps it is because there is no key to sum by. – A5C1D2H2I1M1N2O1R2T1 Oct 28 '12 at 05:43 — A5C1D2H2I1M1N2O1R2T1, Oct 28 '12 at 05:43
But I don't want to sum by key, I want to sum by row!? – Vivi Oct 28 '12 at 06:00 — Vivi, Oct 28 '12 at 06:00
`rowSums` with `na.rm=TRUE` – Joshua Ulrich Oct 28 '12 at 06:49 — Joshua Ulrich, Oct 28 '12 at 06:49

mnel · Answer 1 · 2012-10-28T23:25:33.820

29

This is standard R behaviour, nothing really to do with data.table

Adding anything to NA will return NA

NA + 1
## NA

sum will return a single number

If you want 1 + NA to return 1

then you will have to run something like

mat[,col3 := col1 + col2]
mat[is.na(col1), col3 := col2]
mat[is.na(col2), col3 := col1]

To deal with when col1 or col2 are NA

EDIT - an easier solution

You could also use rowSums, which has a na.rm argument

mat[ , col3 :=rowSums(.SD, na.rm = TRUE), .SDcols = c("col1", "col2")]

rowSums is what you want (by definition, the rowSums of a matrix containing col1 and col2, removing NA values

(@JoshuaUlrich suggested this as a comment )

edited Oct 28 '12 at 23:25

answered Oct 28 '12 at 06:02

mnel

113,303
27
265
254

Ulrich's comment (your edit) seems like what I am after. I can't test now, but should be able to tomorrow. – Vivi Oct 28 '12 at 08:16
`rowSums` is by far the fastest option since it's vectorized – isthisthat Jan 18 '23 at 14:55

IRTFM · Accepted Answer · 2012-11-07T19:14:02.737

22

It's not a lack of understanding of data.table but rather one regarding vectorized functions in R. You can define a dyadic operator that will behave differently than the "+" operator with regard to missing values:

 `%+na%` <- function(x,y) {ifelse( is.na(x), y, ifelse( is.na(y), x, x+y) )}

 mat[ , col3:= col1 %+na% col2]
#-------------------------------
        col1      col2      col3
1:        NA  0.003745  0.003745
2:  0.000000  0.007463  0.007463
3: -0.015038 -0.007407 -0.022445
4:  0.003817 -0.003731  0.000086
5: -0.011407 -0.007491 -0.018898

You can use mrdwad's comment to do it with sum(... , na.rm=TRUE):

mat[ , col4 := sum(col1, col2, na.rm=TRUE), by=1:NROW(mat)]

edited Nov 07 '12 at 19:14

answered Oct 28 '12 at 06:03

IRTFM

258,963
21
364
487

1

I thought I could do something like this, but I really believed there would be a pre-programmed function or way to do this that didn't involve writing my own function... I also thought := should behave by row and perhaps there would be a way to make sum() work (perhaps by using something like with=FALSE). – Vivi Oct 28 '12 at 06:27
It is sensible default behavior. NA rarely implies 0 as you would have it. More sensible to inherit the NA information than assume a value for it. – mnel Oct 28 '12 at 06:31
This has absolutely nothing to do with data.table or :=. – mnel Oct 28 '12 at 06:33
@mnel but why does := sum(col1,col2) sums the entire column and not just the values of col1 and col2 in that particular column? This is why I included the reference to := and data.table – Vivi Oct 28 '12 at 06:42
That is your misunderstanding of what := does. It assigns by reference within mat, so without a lot of internal copying. Nothing to do with references to rows in the data table. – mnel Oct 28 '12 at 06:45
4

@Vivi It's a not a bad point. For `min` and `max` there is `pmin` and `pmax`, so for `sum` why is there no `psum`? Basically you're looking for `psum`. I might ask that myself! ... – Matt Dowle Oct 29 '12 at 14:09
1

@Vivi Now asked here : http://stackoverflow.com/questions/13123638/there-is-pmin-and-pmax-each-taking-na-rm-why-no-psum – Matt Dowle Oct 29 '12 at 14:38
Yes. BenBolker has remembered rowSums. – IRTFM Oct 29 '12 at 15:01
well, I didn't remember it -- I looked it up here. I don't know if I should get any points for it, but I *did* answer the question (the "why doesn't this exist already?" part is basically unanswerable, I think ...) – Ben Bolker Oct 29 '12 at 15:04
@DWin this by=1:NROW(Mat) is very useful! I was looking for something like this for a while, no idea it existed! – Vivi Oct 29 '12 at 18:29
I like the by=1:NROW solution, but it seems a lot slower than the %+na% alternative... – Vivi Nov 06 '12 at 22:51

Using `:=` in data.table to sum the values of two columns in R, ignoring NAs

2 Answers2

EDIT - an easier solution

Linked

Related