2

The excellent Metrics package provides a function to calculate average precision: apk.

The problem is, it's based on a for loop, and it's slow:

require('Metrics')
require('rbenchmark')
actual <- 1:20000
predicted <- c(1:20, 200:600, 900:1522, 14000:32955)
benchmark(replications=10,
          apk(5000, actual, predicted),
          columns= c("test", "replications", "elapsed", "relative"))

                          test replications elapsed relative
1 apk(5000, actual, predicted)           10   53.68        1

I'm at a loss as to how to vectorize this function, but I was wondering if perhaps there's a better way to implement this in R.

Zach
  • 29,791
  • 35
  • 142
  • 201

2 Answers2

5

I'd have to agree the implementation looked pretty bad... Try this instead:

apk2 <- function (k, actual, predicted)  {

    predicted <- head(predicted, k)

    is.new <- rep(FALSE, length(predicted))
    is.new[match(unique(predicted), predicted)] <- TRUE

    is.relevant <- predicted %in% actual & is.new

    score <- sum(cumsum(is.relevant) * is.relevant / seq_along(predicted)) /
             min(length(actual), k)
    score
}

benchmark(replications=10,
          apk(5000, actual, predicted),
          apk2(5000, actual, predicted),
          columns= c("test", "replications", "elapsed", "relative"))

#                            test replications elapsed relative
# 1  apk(5000, actual, predicted)           10  62.194 2961.619
# 2 apk2(5000, actual, predicted)           10   0.021    1.000

identical(apk(5000, actual, predicted),
          apk2(5000, actual, predicted))
# [1] TRUE
flodel
  • 87,577
  • 21
  • 185
  • 223
  • Is `is.new` just `!duplicated(predicted)`? I'd also expect that you'd line up the prediction and actual results: `predicted == actual` rather than `predicted %in% actual`. Oh, but I guess you're comparing the _sequence_, not the individual values. – hadley Dec 07 '12 at 14:29
  • Yes, `!duplicated(predicted)` is a great improvement. For `predicted == actual`, I don't think so. The two vectors could have different lengths from the start or via `k`. – flodel Dec 07 '12 at 14:34
0
I happen to have average precision code written using for loop. I think it is fast enough.

ap <- function(prediction) {
    #prediction is a two column matrix. The first one is the true label and the second one is the prediction value
    result = 0
    ranklist <- prediction[sort(prediction[,2],decreasing=TRUE, index.return=TRUE)$ix,]
    numpos <- length(which(ranklist[,1]==1))
    deltaRecall <- 1/numpos
    pcount <- 0

    for(i in 1:nrow(ranklist)) {
        if(ranklist[i,1] == 1) {
            pcount <- pcount + 1
            precision <- pcount/i
            result <- result + precision*deltaRecall
        }
    }
    return(result)
}

ap_at_N <- function(prediction, N=20) {
    #average precision at N
    result = 0
    ranklist <- prediction[sort(prediction[,2],decreasing=TRUE, index.return=TRUE)$ix,]
    numpos <- length(which(ranklist[,1]==1))
    numpos <- min(N, numpos)
    deltaRecall <- 1/numpos
    pcount <- 0

    for(i in 1:(min(nrow(ranklist),N))) {
        if(ranklist[i,1] == 1) {
            pcount <- pcount + 1
            precision <- pcount/i
            result <- result + precision*deltaRecall
        }
    }
    return(result)
}
user1165814
  • 405
  • 4
  • 5
  • I'd love to see some benchmarking against @flodel's function, which I suspect is much faster for longer sequences. – Zach Feb 04 '14 at 15:53
  • If you would like to accelerate it which is unnecessary in my problem, you can rewrite the for loop using lapply function. – user1165814 Feb 04 '14 at 21:12
  • That won't change it much. I'm also curious-- my example function and the other answer both take a predictor AND a response. How does yours work with just a predictor? – Zach Feb 04 '14 at 21:40
  • #prediction is a two column matrix. The first one is the true label and the second one is the prediction value. prediction = cbind(actual, prediction) – user1165814 Feb 04 '14 at 21:46