ignore NA in dplyr row sum

Question

is there an elegant way to handle NA as 0 (na.rm = TRUE) in dplyr?

data <- data.frame(a=c(1,2,3,4), b=c(4,NA,5,6), c=c(7,8,9,NA))

data %>% mutate(sum = a + b + c)

a  b  c sum
1  4  7  12
2 NA  8  NA
3  5  9  17
4  6 NA  NA

but I like to get

a  b  c sum
1  4  7  12
2 NA  8  10
3  5  9  17
4  6 NA  10

even if I know that this is not the desired result in many other cases

score 76 · Accepted Answer · answered Nov 19 '15 at 14:31

76

You could use this:

library(dplyr)
data %>% 
  #rowwise will make sure the sum operation will occur on each row
  rowwise() %>% 
  #then a simple sum(..., na.rm=TRUE) is enough to result in what you need
  mutate(sum = sum(a,b,c, na.rm=TRUE))

Output:

Source: local data frame [4 x 4]
Groups: <by row>

      a     b     c   sum
  (dbl) (dbl) (dbl) (dbl)
1     1     4     7    12
2     2    NA     8    10
3     3     5     9    17
4     4     6    NA    10

answered Nov 19 '15 at 14:31

LyzandeR

37,047
12
77
87

You are very welcome @ckluss . I provided the most "dplyr -ic" way (if I can say this, in the sense that it is using dplyr in a traditional way as per the tutorials) of doing it. However, using other base functions (alone or in conjunction with dplyr) is definitely more efficient than mine. StevenBeaupre 's and Akrun 's answers are more efficient so you would probably be better off with those if speed is important to you. – LyzandeR Nov 19 '15 at 15:16
@LyzandeR I guess the OP wanted the `dplyr`ish way. So, don't worry about the efficiency. – akrun Nov 19 '15 at 15:24
1

Thanks a lot @akrun . It is very kind of you to say this. I really liked your answer too (have upvoted), btw, and it is a `dplyr` ish way too (using mutate_each and mutate). Very good :) – LyzandeR Nov 19 '15 at 15:27
Note: sum will be 0 if all columns are `NA`. Use `hablar::sum_()` instead of `sum()` if you want sum to be `NA` in this case. See also [this post](https://stackoverflow.com/a/56473131/6152316) – retodomax Jan 04 '23 at 15:14

Steven Beaupré · Answer 2 · 2015-11-19T16:13:09.683

Another option:

data %>%
  mutate(sum = rowSums(., na.rm = TRUE))

Benchmark

library(microbenchmark)
mbm <- microbenchmark(
steven = data %>% mutate(sum = rowSums(., na.rm = TRUE)), 
lyz    = data %>% rowwise() %>% mutate(sum = sum(a, b, c, na.rm=TRUE)),
nar    = apply(data, 1, sum, na.rm = TRUE),
akrun  = data %>% mutate_each(funs(replace(., which(is.na(.)), 0))) %>% mutate(sum=a+b+c),
frank  = data %>% mutate(sum = Reduce(function(x,y) x + replace(y, is.na(y), 0), ., 
                                     init=rep(0, n()))),
times = 10)

#Unit: milliseconds
#   expr         min          lq       mean     median         uq        max neval cld
# steven    9.493812    9.558736   18.31476   10.10280   22.55230   65.15325    10 a  
#    lyz 6791.690570 6836.243782 6978.29684 6915.16098 7138.67733 7321.61117    10   c
#    nar  702.537055  723.256808  799.79996  805.71028  849.43815  909.36413    10  b 
#  akrun   11.372550   11.388473   28.49560   11.44698   20.21214  155.23165    10 a  
#  frank   20.206747   20.695986   32.69899   21.12998   25.11939  118.14779    10 a

Maybe you could add akrun's? I see it at 1.33 vs your 1.00 in unit = "relative" (using this variation: `data %>% mutate(sum = Reduce(function(x,y) x+replace(y,is.na(y),0), ., init=rep(0,n())))`) — Frank, Nov 19 '15 at 15:18
In dplyr >= 1.1.0 you can wrap `pick` in `rowSums` to [tidy-select](https://tidyselect.r-lib.org/reference/language.html) columns: `rowSums(pick(where(is.numeric)), na.rm = T)` — LMc, Aug 23 '23 at 17:12

akrun · Answer 3 · 2015-11-19T15:16:55.747

7

Or we can replace NA with 0 and then use the OP's code

data %>% 
   mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
   mutate(Sum= a+b+c)
   #or as @Frank mentioned
   #mutate(Sum = Reduce(`+`, .))

Based on the benchmarks using @Steven Beaupré data, it seems to be efficient as well.

edited Nov 19 '15 at 15:16

answered Nov 19 '15 at 15:05

akrun

874,273
37
540
662

wibeasley · Answer 4 · 2021-08-20T19:48:56.747

Here's a similar approach to Steven's, but includes dplyr::select() to explicitly state which columns to include/ignore (like ID variables).

data %>% 
  mutate(sum = rowSums(dplyr::select(., a, b, c), na.rm = TRUE))

# Here's a comparable version that uses R's new native pipe.
data |> 
  {\(x)
    mutate(
      x, 
      sum = rowSums(dplyr::select(x, a, b, c), na.rm = TRUE)
    ) 
  }()

It has comparable performance with a realistically-sized dataset. I'm not sure why though, since no columns are actually being excluded in this skinny example.

Bigger dataset of 1M rows:

pick <- function() { sample(c(1:5, NA), 1000000, replace=T) }
data <- data.frame(a=pick(), b=pick(), c=pick())

Results:

Unit: milliseconds
     expr         min          lq        mean      median          uq         max neval cld
   steven    22.05847    22.96164    56.84822    28.85411    54.99691   174.58447    10 a  
wibeasley    25.10274    26.98303    30.66911    29.30630    30.63343    49.46048    10 a  
      lyz 10408.89904 10548.33756 10887.51930 10720.92372 11017.56256 12250.41370    10   c
      nar  1975.35941  2011.36445  2123.81705  2090.43174  2172.80501  2362.13658    10  b 
    akrun    31.27247    35.41943    81.33320    57.93900    63.59119   302.21059    10 a  
    frank    37.48265    38.72270    65.02965    41.62735    44.45775   261.79898    10 a

score 4 · Answer 5 · answered Jul 12 '20 at 07:01

4

With new dplyr 1.0.0 you can use c_across with rowwise.

library(dplyr)

data %>%
  rowwise() %>%
  mutate(sum = sum(c_across(a:c), na.rm = TRUE))

#      a     b     c   sum
#  <dbl> <dbl> <dbl> <dbl>
#1     1     4     7    12
#2     2    NA     8    10
#3     3     5     9    17
#4     4     6    NA    10

answered Jul 12 '20 at 07:01

Ronak Shah

377,200
20
156
213

1

I like the conciseness of this syntax. But for moderately-sized datasets (say 100k rows), the `rowwise()` function is thousands of times slower. Where the fast four versions in my post take <100 ms, this one takes 146,000 ms. – wibeasley Aug 20 '21 at 19:59

score 2 · Answer 6 · answered Nov 19 '15 at 14:29

2

Try this

data$sum <- apply(data, 1, sum, na.rm = T)

Resulting data is

a  b  c sum
1 1  4  7  12
2 2 NA  8  10
3 3  5  9  17
4 4  6 NA  10

answered Nov 19 '15 at 14:29

narendra-choudhary

4,582
4
38
58

ignore NA in dplyr row sum

6 Answers6

Linked

Related