6

I've a number of columns in an xts object, and I want to find the percentage in the first column above a certain number, the percentage in either first or second column above a certain number, the percentage in any of the first three columns above a certain number, etc.

I'm currently doing it manually, as follows:

library(xts)
set.seed(69) 
x = xts( cbind( v.1 = runif(20)*100,  v.2 = runif(20)*100,   v.3 = runif(20)*100,   v.4 = runif(20)*100), Sys.Date()-20:1 )

c(
  mean( x$v.1 > 50),
  mean( x$v.1 > 50 | x$v.2 > 50) ,
  mean( x$v.1 > 50 | x$v.2 > 50 | x$v.3 > 50) ,
  mean( x$v.1 > 50 | x$v.2 > 50 | x$v.3 > 50 | x$v.4 > 50)
  )

Which gives this example output:

[1] 0.50 0.70 0.80 0.95

But now I want to generalize to any number of columns, not just v.1 to v.4. So I'm looking for a single function something like this:

this_is_mean( x, c('v.1','v.2','v.3','v.4'), 50)

or maybe it would look like:

mean ( foo( x, c('v.1','v.2','v.3','v.4'), 50) )

(I'll be using paste('v',1:N,sep='.') for the column names, of course)

Darren Cook
  • 27,837
  • 13
  • 117
  • 217

3 Answers3

3

It seems like you should be able to use sapply and rowSums (if I understand your problem correctly):

sapply(1:ncol(x), function(y) mean(rowSums(x[, seq(y)] > 50) >= 1))
## [1] 0.50 0.70 0.80 0.95

Use vapply(1:ncol(x), function(y) mean(rowSums(x[, seq(y)] > 50) >= 1), numeric(1L)) to eek out a little bit more speed, if required.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
3

Here's an another alternative that seems to be a bit faster than @AnadaMahto's solution (on this example). You might also find it a bit more straightforward.

R> rowMeans(apply(x > 50, 1, cumsum) >= 1)
 v.1  v.2  v.3  v.4 
0.50 0.70 0.80 0.95

Though do note that rowMeans only makes one pass over the data. Unlike mean, which makes 2 passes (one is for floating point arithmetic error-correction).

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Off-topic: do you have a link explaining why mean does two passes? All I found so far was a link to the code itself: https://github.com/wch/r-source/blob/trunk/src/main/summary.c#L430 I couldn't quite grasp if this was for when your vector is full of large numbers, or has numbers of very different magnitude, or just for when the vector is very large... anyway, this was for the final stage of a report, so I'm neither CPU bound, nor needing more than 2d.p. accuracy :-) – Darren Cook Jul 02 '15 at 11:36
  • If I only want this to operate on v.1 .. v.9 in a 50-column xts object, how do I specify the columns it should use? Or do I just subset `x` first? – Darren Cook Jul 02 '15 at 11:38
  • 1
    @DarrenCook: either subset `x` first, or subset the result. Regarding why mean does two passes, see [What algorithm is R using to calculate mean?](http://stackoverflow.com/q/17866149/271616). I'm not sure it will matter in this case, but wanted to point it out because the results from `mean` and `rowMeans` may not be `identical`, though they should be `all.equal`. – Joshua Ulrich Jul 02 '15 at 11:40
1

We can also do it by matrix multiplication:

colSums(((x>50) %*% !lower.tri(diag(ncol(x))))>0) / nrow(x)

Multiplication by the diagonal matrix selects the first column, the first two columns and so on. I compared it to

rowMeans(apply(x > 50, 1, cumsum) >= 1)

and it seems to be even faster, though the expression is uglier.

mra68
  • 2,960
  • 1
  • 10
  • 17