0

I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:

a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
   a  b
1 NA  2
2 NA NA
3  3  2

and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:

df$c <- df$a + df$b
df
   a  b  c
1 NA  2 NA
2 NA NA NA
3  3  2  5

but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:

df$d <- rowSums(df, na.rm=T)
df
   a  b  c  d
1 NA  2 NA  2
2 NA NA NA  0
3  3  2  5 10

am I missing something?

Thanks to all

matteo
  • 4,683
  • 9
  • 41
  • 77

2 Answers2

6

Because

sum(numeric(0))
# 0

Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.

If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:

my_rowSums <- function(x) {
  if (is.data.frame(x)) x <- as.matrix(x)
  z <- base::rowSums(x, na.rm = TRUE)
  z[!base::rowSums(!is.na(x))] <- NA
  z
  }

my_rowSums(df)
# [1]  2 NA 10

This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.

For @akrun's "hacking" answer, I suggest:

akrun_rowSums <- function (x) {
  if (is.data.frame(x)) x <- as.matrix(x)
  rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
  }

akrun_rowSums(df)
# [1]  2 NA 10
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • mm ok.. But what if I want to keep NA also in the third column? – matteo Jul 23 '16 at 17:10
  • 4
    This will probably be a 2 step process. For example, `df$new <- rowSums(df, na.rm=T); is.na(df$new) <- rowSums(is.na(df)) == length(df)` – lmo Jul 23 '16 at 17:21
6

One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)

rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1]  2 NA 10
akrun
  • 874,273
  • 37
  • 540
  • 662