38

It seems that R might be missing an obvious simple function: psum. Does it exist as a different name, or is it in a package somewhere?

x = c(1,3,NA,5)
y = c(2,NA,4,1)

min(x,y,na.rm=TRUE)    # ok
[1] 1
max(x,y,na.rm=TRUE)    # ok
[1] 5
sum(x,y,na.rm=TRUE)    # ok
[1] 16

pmin(x,y,na.rm=TRUE)   # ok
[1] 1 3 4 1
pmax(x,y,na.rm=TRUE)   # ok
[1] 2 3 4 5
psum(x,y,na.rm=TRUE)
[1] 3 3 4 6                             # expected result
Error: could not find function "psum"   # actual result

I realise that + is already like psum, but what about NA?

x+y                      
[1]  3 NA NA  6        # can't supply `na.rm=TRUE` to `+`

Is there a case to add psum? Or have I missed something.

This question is a follow up from this question :
Using := in data.table to sum the values of two columns in R, ignoring NAs

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224

3 Answers3

21

Following @JoshUlrich's comment on the previous question,

psum <- function(...,na.rm=FALSE) { 
    rowSums(do.call(cbind,list(...)),na.rm=na.rm) } 

edit: from Sven Hohenstein:

psum2 <- function(...,na.rm=FALSE) { 
    dat <- do.call(cbind,list(...))
    res <- rowSums(dat, na.rm=na.rm) 
    idx_na <- !rowSums(!is.na(dat))
    res[idx_na] <- NA
    res 
}

x = c(1,3,NA,5,NA)
y = c(2,NA,4,1,NA)
z = c(1,2,3,4,NA)

psum(x,y,na.rm=TRUE)
## [1] 3 3 4 6 0
psum2(x,y,na.rm=TRUE)
## [1] 3 3 4 6 NA

n = 1e7
x = sample(c(1:10,NA),n,replace=TRUE)
y = sample(c(1:10,NA),n,replace=TRUE)
z = sample(c(1:10,NA),n,replace=TRUE)

library(rbenchmark)
benchmark(psum(x,y,z,na.rm=TRUE),
          psum2(x,y,z,na.rm=TRUE),
          pmin(x,y,z,na.rm=TRUE), 
          pmax(x,y,z,na.rm=TRUE), replications=20)

##                          test replications elapsed relative 
## 4  pmax(x, y, z, na.rm = TRUE)           20  26.114    1.019 
## 3  pmin(x, y, z, na.rm = TRUE)           20  25.632    1.000 
## 2 psum2(x, y, z, na.rm = TRUE)           20 164.476    6.417
## 1  psum(x, y, z, na.rm = TRUE)           20  63.719    2.486

Sven's version (which arguably is the correct one) is quite a bit slower, although whether it matters obviously depends on the application. Anyone want to hack up an inline/Rcpp version?

As for why this doesn't exist: don't know, but good luck getting R-core to make additions like this ... I can't offhand think of a sufficiently widespread *misc package into which this could go ...

Follow up thread by Matthew on r-devel is here (which seems to confirm) :
r-devel: There is pmin and pmax each taking na.rm, how about psum?

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Those don't return the required result. They remove the `NA`. The desired result is a length 4 vector : `3 3 4 6`. – Matt Dowle Oct 29 '12 at 14:41
  • Not `pdiff` because `psum(x,-y,na.rm=TRUE)` would do that. Perhaps `ptimes` too, but not `pdiv` as well for the same reason. But I guess that `psum` is a more common case than `ptimes`. – Matt Dowle Oct 29 '12 at 14:55
  • Yes that's one implementation of `psum`, but it would be very inefficient. I wasn't asking how might `psum` be implemented, but, what I asked really. – Matt Dowle Oct 29 '12 at 15:10
  • @MatthewDowle I'm not sure we *can* answer what you really asked - There aren't too many R Core people on [so] - one or two yes but wouldn't you be better asking that question on R-Devel? – Gavin Simpson Oct 29 '12 at 15:16
  • @GavinSimpson Ok will do. I know I'm much better off asking here first _then_ asking on r-devel if it survives you guys. – Matt Dowle Oct 29 '12 at 15:20
  • I don't think R-Core *will* answer. And I don't really think there is an answer, other than "no-one thought of it at the time, or when they did think of it they thought it wasn't necessary" -- and now they won't want to add it to base R because adding *anything* is considered a burden. Much discussion at http://stackoverflow.com/questions/8065835/proposing-feature-requests-to-the-r-core-team/ – Ben Bolker Oct 29 '12 at 15:22
  • 2
    Just checked performances of `pmin`,`pmax` and `psum` from this answer. 3 vectors of length `1e5`, `1e3` replications and results were pretty similar: 6.24, 6.21, 7.55 seconds respectively. While few other versions of `psum` that I found got 30 and more. – Julius Vainora Oct 29 '12 at 15:27
  • 1
    @Julius, feel free to edit my answer if you like (or post your benchmarks as an answer) – Ben Bolker Oct 29 '12 at 16:43
  • @Julius `n=1e5`? `7.26/1000==0.007`. i.e. try `benchmark` with `replications=3` and scale up the data size, instead. – Matt Dowle Oct 29 '12 at 17:37
  • 2
    This `psum` function is a good idea. Just one addition: Both `psum` and `pmax` return `NA` if all elements (at the same position in the vectors) are `NA` (even if `na.rm = TRUE`). Check `pmin(NA, NA, na.rm = TRUE)`. I slightly modified your function to achieve the same behavior for `psum`: `psum <- function(...,na.rm=FALSE) { dat <- do.call(cbind,list(...)); "[<-"(rowSums(dat, na.rm=na.rm), rowSums(is.na(dat)) == ncol(dat), NA) }` – Sven Hohenstein Oct 29 '12 at 17:51
  • @SvenHohenstein: good point. My only concerns are that (1) of course this will slow things down further (if that is an issue) and (2) I would probably split your second statement into several separate lines: it's a little *too* clever/compact for me ... – Ben Bolker Oct 29 '12 at 20:38
  • (1) I agree that the drop in speed is the major drawback of this approach. (2) Here's a modified version: `psum <- function(...,na.rm=FALSE) { dat <- do.call(cbind,list(...)); res <- rowSums(dat, na.rm=na.rm); idx_na <- !rowSums(!is.na(dat)); res[idx_na] <- NA; return(res) }` – Sven Hohenstein Oct 30 '12 at 06:30
  • It amuses me that the discussion on r-devel followed almost the same lines as the one here. I'm mildly disappointed but not at all surprised that R-core doesn't seem to have nibbled. – Ben Bolker Nov 01 '12 at 11:30
  • @BenBolker I'm not surprised either. A flat "no" would have been worse so I'm holding out hope and following step 1 of your good advice [here](http://stackoverflow.com/a/8066062/403310) i.e. waiting a week. The C implementation is very easy - just a matter of tweaking the existing `pmin` and `pmax` C source. – Matt Dowle Nov 01 '12 at 12:00
9

After a quick search on CRAN, there are at least 3 packages that have a psum function. rccmisc, incadata and kit. kit seems to be the fastest. Below reproducing the example of Ben Bolker.

benchmark(
  rccmisc::psum(x,y,z,na.rm=TRUE),
  incadata::psum(x,y,z,na.rm=TRUE),
  kit::psum(x,y,z,na.rm=TRUE), 
  psum(x,y,z,na.rm=TRUE),
  psum2(x,y,z,na.rm=TRUE),
  replications=20
)
#                                    test replications elapsed relative
# 2 incadata::psum(x, y, z, na.rm = TRUE)           20   20.05   14.220
# 3      kit::psum(x, y, z, na.rm = TRUE)           20    1.41    1.000
# 4           psum(x, y, z, na.rm = TRUE)           20    8.04    5.702
# 5          psum2(x, y, z, na.rm = TRUE)           20   20.44   14.496
# 1  rccmisc::psum(x, y, z, na.rm = TRUE)           20   23.24   16.482
Suresh_Patel
  • 291
  • 3
  • 5
1

Another approach whose advantage is to also work with matrices, just like pmin and pmax.

psum <- function(..., na.rm = FALSE) {
  plus_na_rm <- function(x, y) ifelse(is.na(x), 0, x) + ifelse(is.na(y), 0, y)
  Reduce(if(na.rm) plus_na_rm else `+`, list(...))
}

x = c(1,3,NA,5)
y = c(2,NA,4,1)

psum(x, y)
#> [1]  3 NA NA  6
psum(x, y, na.rm = TRUE)
#> [1] 3 3 4 6

# With matrices
A <- matrix(1:9, nrow = 3)
B <- matrix(c(NA, 2:8, NA), nrow = 3)

psum(A, B)
#>      [,1] [,2] [,3]
#> [1,]   NA    8   14
#> [2,]    4   10   16
#> [3,]    6   12   NA
psum(A, B, na.rm = TRUE)
#>      [,1] [,2] [,3]
#> [1,]    1    8   14
#> [2,]    4   10   16
#> [3,]    6   12    9

Created on 2020-03-09 by the reprex package (v0.3.0)

One caveat: if an element is NA across all the summed objects and na.rm = TRUE, the result will be 0 (and not NA).

For example:

psum(NA, NA, na.rm = TRUE)
#> [1] 0
asachet
  • 6,620
  • 2
  • 30
  • 74