How to do cumulative logical operations on mutliple columns

Question

I've a number of columns in an xts object, and I want to find the percentage in the first column above a certain number, the percentage in either first or second column above a certain number, the percentage in any of the first three columns above a certain number, etc.

I'm currently doing it manually, as follows:

library(xts)
set.seed(69) 
x = xts( cbind( v.1 = runif(20)*100,  v.2 = runif(20)*100,   v.3 = runif(20)*100,   v.4 = runif(20)*100), Sys.Date()-20:1 )

c(
  mean( x$v.1 > 50),
  mean( x$v.1 > 50 | x$v.2 > 50) ,
  mean( x$v.1 > 50 | x$v.2 > 50 | x$v.3 > 50) ,
  mean( x$v.1 > 50 | x$v.2 > 50 | x$v.3 > 50 | x$v.4 > 50)
  )

Which gives this example output:

[1] 0.50 0.70 0.80 0.95

But now I want to generalize to any number of columns, not just v.1 to v.4. So I'm looking for a single function something like this:

this_is_mean( x, c('v.1','v.2','v.3','v.4'), 50)

or maybe it would look like:

mean ( foo( x, c('v.1','v.2','v.3','v.4'), 50) )

(I'll be using paste('v',1:N,sep='.') for the column names, of course)

Thanks David, Pascal. Made the example more reproducible. – Darren Cook Jul 02 '15 at 10:57 — Darren Cook, Jul 02 '15 at 10:57

A5C1D2H2I1M1N2O1R2T1 · Answer 1 · 2015-07-02T11:00:09.637

3

It seems like you should be able to use sapply and rowSums (if I understand your problem correctly):

sapply(1:ncol(x), function(y) mean(rowSums(x[, seq(y)] > 50) >= 1))
## [1] 0.50 0.70 0.80 0.95

Use vapply(1:ncol(x), function(y) mean(rowSums(x[, seq(y)] > 50) >= 1), numeric(1L)) to eek out a little bit more speed, if required.

edited Jul 02 '15 at 11:00

answered Jul 02 '15 at 10:56

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

hmpf, was posting *exactly* the same thing... You can change to just `> 0` instead of `>= 1` – David Arenburg Jul 02 '15 at 10:58

score 3 · Accepted Answer · answered Jul 02 '15 at 11:25

3

Here's an another alternative that seems to be a bit faster than @AnadaMahto's solution (on this example). You might also find it a bit more straightforward.

R> rowMeans(apply(x > 50, 1, cumsum) >= 1)
 v.1  v.2  v.3  v.4 
0.50 0.70 0.80 0.95

Though do note that rowMeans only makes one pass over the data. Unlike mean, which makes 2 passes (one is for floating point arithmetic error-correction).

answered Jul 02 '15 at 11:25

Joshua Ulrich

173,410
32
338
418

Off-topic: do you have a link explaining why mean does two passes? All I found so far was a link to the code itself: https://github.com/wch/r-source/blob/trunk/src/main/summary.c#L430 I couldn't quite grasp if this was for when your vector is full of large numbers, or has numbers of very different magnitude, or just for when the vector is very large... anyway, this was for the final stage of a report, so I'm neither CPU bound, nor needing more than 2d.p. accuracy :-) – Darren Cook Jul 02 '15 at 11:36
If I only want this to operate on v.1 .. v.9 in a 50-column xts object, how do I specify the columns it should use? Or do I just subset `x` first? – Darren Cook Jul 02 '15 at 11:38
1

@DarrenCook: either subset `x` first, or subset the result. Regarding why mean does two passes, see [What algorithm is R using to calculate mean?](http://stackoverflow.com/q/17866149/271616). I'm not sure it will matter in this case, but wanted to point it out because the results from `mean` and `rowMeans` may not be `identical`, though they should be `all.equal`. – Joshua Ulrich Jul 02 '15 at 11:40

score 1 · Answer 3 · answered Jul 02 '15 at 12:23

We can also do it by matrix multiplication:

colSums(((x>50) %*% !lower.tri(diag(ncol(x))))>0) / nrow(x)

Multiplication by the diagonal matrix selects the first column, the first two columns and so on. I compared it to

rowMeans(apply(x > 50, 1, cumsum) >= 1)

and it seems to be even faster, though the expression is uglier.

How to do cumulative logical operations on mutliple columns

3 Answers3