Is it posible to optimize (vectorize) these two functions for better performance

Question

In my first attempts in using R I wrote two functions that are not very performant I guess and would appreciate if I can receive some hints on how to make them more performant (vectorized). Both functions come with "test case" at the end.

The first function takes two time series xts objects x and y and returns a series which contains data on how many days x is higher/lower than y.

require('xts')
require('quantmod')

countDaysBelowOrAbove <- function(x, y) {
    x <- try.xts(x, error=as.matrix)
    y <- try.xts(y, error=as.matrix)

    if(is.xts(x) && is.xts(y)) {
        xy <- cbind(x,y)
    } else {
        xy <- cbind( as.vector(x), as.vector(y) )
    }

    # Count NAs, ensure they're only at beginning of data, then remove.
    xNAs <- sum( is.na(x) )
    yNAs <- sum( is.na(y) )
    NAs <- max( xNAs, yNAs )
    if( NAs > 0 ) {
        if( any( is.na(xy[-(1:NAs),]) ) ) stop("Series contain non-leading NAs")
    }

    resultDaysLower <- x
    resultDaysHigher <- x
    resultDaysLower[!is.na(resultDaysLower)]<-0
    resultDaysHigher[!is.na(resultDaysHigher)]<-0

    series<-cbind(xy, resultDaysLower, resultDaysHigher)
    colnames(series) <- c(names(xy), "cumDaysLower", "cumDaysHigher")

    daysLower = 0
    daysHigher = 0

    for (i in 1:NROW(xy)) {
        if (!(is.na(series[,1][i]) | is.na(series[,2][i]))) {
            if (series[,1][i] >= series[,2][i]) {
                daysLower = 0
                daysHigher = daysHigher + 1
            }
            else {
                daysHigher = 0
                daysLower = daysLower + 1
            }
        }
        else {
            daysLower = 0
            daysHigher = 0
        }
        series$cumDaysLower[i] = daysLower
        series$cumDaysHigher[i] = daysHigher                
    }
    return(series)
}

getSymbols("SPY", from='2005-01-01')
SPYclose = Cl(SPY)

getSymbols("QQQQ", from='2005-01-01')
QQQQclose = Cl(QQQQ)

testData = countDaysBelowOrAbove(SPYclose, QQQQclose)

The second function I would appreciate help with performance optimization is below. The function takes as parameter an xts object series and an xts object representing lengths of interval to calculate minimum of series at a specified time. The function returns calculated minimum of series with specified window for minimum calculation set in lengths.

minimumWithVaryingLength<-function(series, lengths) {
    series <- try.xts(series, error=as.matrix)
    lengths <- try.xts(lengths, error=as.matrix)

    if(is.xts(series) && is.xts(lengths)) {
        serieslengths <- cbind(series,lengths)
    } else {
        serieslengths <- cbind( as.vector(series), as.vector(lengths) )
    }

    # Count NAs, ensure they're only at beginning of data, then remove.
    seriesNAs <- sum( is.na(series) )
    lengthsNAs <- sum( is.na(lengths) )
    NAs <- max( seriesNAs, lengthsNAs )
    if( NAs > 0 ) {
        if( any( is.na(serieslengths[-(1:NAs),]) ) ) stop("Series contain non-leading NAs")
    }

    result <- series
    result[!is.na(result)]<-0

    for (i in 1:NROW(serieslengths)) {  
        if (lengths[i] > 0) {
            result[i] <- runMin(series, n=lengths[i], cumulative=FALSE)[i]
        }
        else {
            result[i] <- 0
        }
    }

    return(result)
}

getSymbols("SPY", from='2005-01-01')
SPYclose = Cl(SPY)

getSymbols("QQQQ", from='2005-01-01')
QQQQclose = Cl(QQQQ)

numDaysBelow = countDaysBelowOrAbove(SPYclose, QQQQclose)
test = minimumWithVaryingLength(SPYclose, numDaysBelow)

Thanks in advance for your kind help.

Kind regards, Samo.

score 4 · Accepted Answer · edited May 23 '17 at 10:33

For the first function you're looking for the cumulative number of periods during which series x is lower/higher than y. For that you can use this handy function CumCount() built from cummax. First some sample data:

set.seed(1)
x <- sample(1:5,20,T)
y <- sample(1:5,20,T)

CumCount <- function(x) {
  z <- cumsum(x)
  z - cummax(z*(!x))
}

CumLow = CumCount(x<y)
CumHigh = CumCount(x>y)

For your second computation, you're trying to find the cumulative minimum x value within each period during which x < y. For this the rle function is very useful ("run-length-encoding").

# runs equals the length of each phase (x < y or x > y)
runs <- rle(CumLow > 0)$lengths
# starts is the number of periods prior to each phase...
starts <- c(0,cumsum(runs)[-length(runs)]) 
#... which we use to build "blocks", a list of indices of each phase.
blocks <- mapply( function(x,y) x+y, starts, lapply(runs,seq))
# now apply the cummin function within each block:
# (remember to mask it by CumLow > 0 -- 
#   we only want to do this within the x<y phase)
BlockCumMin <- unlist(sapply(blocks, function(blk) cummin(x[blk]))) * (CumLow > 0)

Now we put it all together:

  > cbind(x,y, CumLow, CumHigh, BlockCumMin)

      x y CumLow CumHigh BlockCumMin
 [1,] 3 4      1       0           3
 [2,] 4 2      0       1           0
 [3,] 2 2      0       0           0
 [4,] 2 5      1       0           2
 [5,] 4 4      0       0           0
 [6,] 2 2      0       0           0
 [7,] 4 1      0       1           0
 [8,] 1 3      1       0           1
 [9,] 2 5      2       0           1
[10,] 1 3      3       0           1
[11,] 2 5      4       0           1
[12,] 1 4      5       0           1
[13,] 4 2      0       1           0
[14,] 5 3      0       2           0
[15,] 4 1      0       3           0
[16,] 4 1      0       4           0
[17,] 3 4      1       0           3
[18,] 3 1      0       1           0
[19,] 5 3      0       2           0
[20,] 4 4      0       0           0

Note that this problem is related to this question

Update. For the more general case where you have a series vector, a lengths vector (of same length as series), and you want to produce a result called BlockMins where BlockMins[i] is the minimum of the lengths[i] block of series ending at position i, you could do the following. Since the lengths are arbitrary, this is no longer a cumulative min; for each i you have to take the min of the length[i] elements of series ending at position i:

set.seed(1)
series <- sample(1:5,20,T)
lengths <- sample(3:5,20,T)
BlockMins <- sapply(seq_along(lengths), 
                    function(i) min( series[ i : max(1, (i - lengths[i]+1)) ]) )
> cbind(series, lengths, BlockMins)
      series lengths BlockMins
 [1,]      1       5         1
 [2,]      1       4         1
 [3,]      3       3         1
 [4,]      4       4         1
 [5,]      5       3         3
 [6,]      1       4         1
 [7,]      1       5         1
 [8,]      4       3         1
 [9,]      2       5         1
[10,]      2       4         1
[11,]      1       5         1
[12,]      2       5         1
[13,]      2       3         1
[14,]      2       4         1
[15,]      4       5         1
[16,]      3       5         2
[17,]      5       3         3
[18,]      1       4         1
[19,]      5       3         1
[20,]      3       3         1

Prasad, thank you very much. This is really insightful. Is cummin function really the same as runMin in TTR package? — Samo, Mar 02 '11 at 17:34
You're welcome. The `runMin` function in `TTR` has more functionality than `cummin` -- it allows you to do the `min` over a moving window. In fact it calls `cummin` if the `cumulative` optional arg = `TRUE`. You can see how it's implemented by typing `runMin` at the R console -- remember R is open-source! — Prasad Chalasani, Mar 02 '11 at 18:01
So I can simply exchange cummin with runMin from TTR in your code and everyting will work as in my nonperformant example above except it will be much faster? Beacuse basically I ned two separate functions as stated in my question with different moving window. Thnx. — Samo, Mar 02 '11 at 18:16
Not sure what you mean -- why would you want to use runMin instead of cummin in my code? In my code I was showing how you can do the two types of running calculations: one is a cumulative count of number of days x is below y in the current "phase"; the other is a cumulative minimum in each "phase", where by "phase" I mean a contiguous period (or block) of time during which x is less than y. — Prasad Chalasani, Mar 02 '11 at 18:55
`cummin` is part of base R, whereas TTR is an add-on package. So when you use `cummin` you are not using a different package, so I would suggest using the `cummin` as it is. Moreover, it is not straightforward to replace the `cummin` in the above code with `runMin`. For example, if you do `runMin(1:5, n=1, cumulative=TRUE)`, you get `c(NA,1,1,1,1)`, and you have to replace the NA with the first element of the input vector. This all sounds much more tedious than just using `cummin`. — Prasad Chalasani, Mar 02 '11 at 19:16
Basically I need second function, that calculates minimums with varying lengths (based on a vector of lengths supplied as a parameter to the function) separate from the first one. Because they are used differently, not always just in the context of one time series being below another one... So, is it possible to replace a for loop for (i in 1:NROW(serieslengths)) { if (lengths[i] > 0) { result[i] <- runMin(series, n=lengths[i], cumulative=FALSE)[i] } else { result[i] <- 0 } } with some vetorisation with applying runMin... — Samo, Mar 02 '11 at 19:38
Ok I see what you want -- if the lengths are arbitrary, it's no longer a cumulative min at all: you simply do a min over each block. In your code you're doing a `runMin` over the *entire* series, then extracting just the `i`th element -- this is wasteful. Pls see my added code to do this. — Prasad Chalasani, Mar 02 '11 at 20:14
Thnx. Exactley. At every timestamp i for series I wat to look back lengths[i] periods and calculate minimum of the series for this window. Thanx again for your help. — Samo, Mar 02 '11 at 20:36

score 1 · Answer 2 · answered Mar 02 '11 at 01:56

1

Without dealing with the time series apparatus, if you have two vectors x and y and want to "return a series which contains data on how many days x is higher/lower than y," simply compare them:

# Make up some data
x <- seq(100)
y <- x[sample(x)]
# Compare
x.greater <- sum(x>y)
x.lesser <- sum(x<y)

The key to this is that when you sum a logical vector e.g. (x>y), R coerces TRUEs to 1 and FALSEs to 0.

answered Mar 02 '11 at 01:56

Ari B. Friedman

71,271
35
175
235

No worries. We've all been there (myself more than others!), and sometimes writing the complicated function helps you figure out exactly what you need out of it. It's all part of the process (and the power of vectorization!). – Ari B. Friedman Mar 02 '11 at 02:19
1

This isn't quite what Samo was looking for -- I believe it's more complex. Pls see my answer. – Prasad Chalasani Mar 02 '11 at 04:06
@Prasad: Your answer definitely fits the problem better. – Ari B. Friedman Mar 03 '11 at 00:36

Is it posible to optimize (vectorize) these two functions for better performance

2 Answers2