3

I have a custom rank function that I stole from here (with some modifications) :) Rcpp sugar for rank function

The problem is that it does min ties and I need average ties

Here is what I have

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector sort_rcpp(NumericVector x) {
  std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);
  std::sort(tmp.begin(), tmp.end());
  return wrap(tmp);
}

// [[Rcpp::export]]
IntegerVector rank_(NumericVector x) {
  return match(x, sort_rcpp(x));
}


/*** R
x <- c(1:5, 1:5)

rank(x, ties = 'average')
rank(x, ties = 'min')
rank_(x)
*/

After saving that to a file then running this gets the results

Rcpp::sourceCpp('~/Documents/Rank.cpp')

Which returns

# x <- c(1:5, 1:5)
#
# # what I need
# rank(x, ties = 'average')
# [1] 1.5 3.5 5.5 7.5 9.5 1.5 3.5 5.5 7.5 9.5
#
# # What I am getting
# rank(x, ties = 'min')
# [1] 1 3 5 7 9 1 3 5 7 9
#
# rank_(x)
# [1] 1 3 5 7 9 1 3 5 7 9

What do I need to modify in the c++ code to match the average rank function from base R?

Community
  • 1
  • 1
JackStat
  • 1,593
  • 1
  • 11
  • 17
  • I am happy to modify the question if I made a mistake or if something is unclear but I am at a loss for understanding the downvotes :( – JackStat Aug 25 '16 at 19:33
  • 1
    not sure... maybe for lack of attribution of the "stolen" code? On a side note, rank in R is going to be written in C I'd wager, and will have many sanity checks that you might ignore or fail to forsee. – shayaa Aug 25 '16 at 19:51
  • You may have to write one. A bit of googling lead me to [this page](https://sites.google.com/site/jivsoft/Home/compute-ranks-of-elements-in-a-c---array-or-vector) which has code for rank including the average case. – Dirk Eddelbuettel Aug 25 '16 at 21:59
  • 1
    @DirkEddelbuettel There is code [here too](http://stackoverflow.com/questions/30822729/create-ranking-for-vector-of-double/30827731#30827731). – shayaa Aug 25 '16 at 22:04
  • @shayaa Nice, looks pretty straightforward. – Dirk Eddelbuettel Aug 25 '16 at 22:06
  • 1
    @shayaa that is perfect! Thank you – JackStat Aug 25 '16 at 23:00
  • For people reading this in 2023 - the data.table package has a function called `frank` that does average ranks and is pretty fast (faster on large vectors than the accepted solution below). I know that the question is directly asking for an Rcpp solution, but just thought I offered this solution as well. – Skumin May 13 '23 at 17:51

2 Answers2

8

This is an adapted version of René Richter's code in shayaa's link -- the main differences being the use of Rcpp::seq instead of std::iota and a custom comparator that handles NA comparisons:

#include <Rcpp.h>

class Comparator {
private:
    const Rcpp::NumericVector& ref;

    bool is_na(double x) const 
    {
        return Rcpp::traits::is_na<REALSXP>(x);    
    }

public:
    Comparator(const Rcpp::NumericVector& ref_)
        : ref(ref_)
    {}

    bool operator()(const int ilhs, const int irhs) const
    {
        double lhs = ref[ilhs], rhs = ref[irhs]; 
        if (is_na(lhs)) return false;
        if (is_na(rhs)) return true;
        return lhs < rhs;
    }
};

// [[Rcpp::export]]
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
{
    R_xlen_t sz = x.size();
    Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
    std::sort(w.begin(), w.end(), Comparator(x));

    Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
    for (R_xlen_t n, i = 0; i < sz; i += n) {
        n = 1;
        while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
        for (R_xlen_t k = 0; k < n; k++) {
            r[w[i + k]] = i + (n + 1) / 2.;
        }
    }

    return r;
}

Verifying the results against base::rank,

x <- c(1:7, 1:2, 1:5, 1:10)
all.equal(
    rank(x, ties.method = "average"), 
    avg_rank(x)
)
# [1] TRUE 

Also note that this handles NAs correctly while your version does not:

all.equal(
    rank(c(NA, x), ties.method = "average"), 
    avg_rank(c(NA, x))
)
# [1] TRUE

all.equal(
    rank(c(NA, x), ties.method = "average"), 
    rank_(c(NA, x))
)
# Error: can't subset using a logical vector with NAs

Here is a benchmark with the vector x from above:

microbenchmark::microbenchmark(
    ".Internal" = .Internal(rank(x, length(x), ties = "average")),
    avg_rank(x),
    "base::rank" = rank(x, ties.method = "average"),
    rank_(x),
    times = 1000L
)
# Unit: microseconds
#         expr    min     lq      mean median     uq     max neval
#    .Internal  1.283  1.711  2.029777  1.712  2.139  65.002  1000
#  avg_rank(x)  2.566  3.422  4.057262  3.849  4.277  23.521  1000
#   base::rank 13.685 16.251 18.145440 17.534 18.390 163.360  1000
#     rank_(x) 25.659 28.653 31.203092 29.936 32.074 112.898  1000

And here is a benchmark with a 1e6-length vector (I didn't include rank_ because even a single evaluation was taking forever; see below):

set.seed(123)
xx <- sample(x, 1e6, TRUE)

microbenchmark::microbenchmark(
    ".Internal" = .Internal(rank(xx, length(xx), ties = "average")),
    avg_rank(xx),
    "base::rank" = rank(xx, ties.method = "average"),
    times = 100L
)
# Unit: milliseconds
#          expr      min       lq     mean   median       uq      max neval
#     .Internal 302.2488 309.7977 334.7977 322.0396 347.4779 504.1041   100
#  avg_rank(xx) 304.4435 309.9840 330.4902 316.7807 331.6825 427.7171   100
#    base::rank 312.1196 327.3319 351.6237 343.1317 366.7316 445.9004   100

For larger sized vectors, the runtime of these three functions is much closer. In theory, the .Internal call should always be a little faster than base::rank since it foregoes additional checks that take place in the body of the latter. However, the difference is less pronounced in the second benchmark as the amount of time needed to do those checks represents a much smaller proportion of the function's total runtime.


As a side note, one of the obvious reasons your code is so inefficient is because you are creating vectors in the body of your loop:

for (int i = 0; i < n; ++i) {
    NumericVector xVal = x[x == x[i]];              // here
    IntegerVector Match = match(xVal, sortX);       // here
    double minM = min(Match);
    int matchSize = Match.size();
    NumericVector aves = NumericVector(matchSize);  // here

    for (int k = 0; k < matchSize; ++k) {
        aves[k] = minM + (k);
    }
    ranks[i] = sum(aves)/Match.size();
}

Both avg_rank and R's implementation (I believe, you can double check the source code) are only creating two additional vectors, regardless of the size of the input. Your function is creating 2 + 3 * N vectors (!!!), where N is the size of your input.

And finally, not really related to performance, but instead of sorting like this (which will not handle NAs correctly),

NumericVector sort_rcpp(NumericVector x) {
    std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);
    std::sort(tmp.begin(), tmp.end());
    return wrap(tmp);
} 

just use the tools Rcpp provides:

NumericVector RcppSort(NumericVector x) {
    return Rcpp::clone(x).sort();
}
Community
  • 1
  • 1
nrussell
  • 18,382
  • 4
  • 47
  • 60
  • Wow! That is great. I will be putting this in my R package. How would you like credit? – JackStat Aug 26 '16 at 17:41
  • 1
    You can just cite the link to this question in a comment at the top of the function's source file. The credit should really be to René Richter; I just made a couple of trivial adjustments to their algorithm. – nrussell Aug 26 '16 at 17:46
0

Ok I worked out the code in R and then translated it to Rcpp. I was hoping that it would be as fast as the rank_ function that I had in the question (now minrank) but it is quite slow compared to the version in base R

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
NumericVector sort_rcpp(NumericVector x) {
  std::vector<double> tmp = Rcpp::as< std::vector<double> > (x);
  std::sort(tmp.begin(), tmp.end());
  return wrap(tmp);
}

// [[Rcpp::export]]
IntegerVector minrank(NumericVector x) {
  return match(x, sort_rcpp(x));
}

// [[Rcpp::export]]
NumericVector rank_(NumericVector x) {

  NumericVector sortX = sort_rcpp(x);

  int n = x.size();

  NumericVector ranks = NumericVector(n);

  for(int i = 0; i < n; ++i) {
    NumericVector xVal = x[x == x[i]];
    IntegerVector Match = match(xVal, sortX);
    double minM = min(Match);
    int matchSize = Match.size();
    NumericVector aves = NumericVector(matchSize);

    for(int k = 0; k < matchSize; ++k){
      aves[k] = minM + (k);
    }
    ranks[i] = sum(aves)/Match.size();
  }

  return ranks;

}


/*** R

x <- c(1:7, 1:2, 1:5, 1:10)
r1 <- rank(x, ties = 'average')
r2 <- rank_(x)
all(r1 == r2)

library(microbenchmark)
microbenchmark(
  rank(x, ties = 'average')
  ,rank_(x)
  ,rank(x, ties = 'min')
  ,minrank(x)
  ,.Internal(rank(x, length(x), ties = 'average'))
)

*/


#> x <- c(1:7, 1:2, 1:5, 1:10)
#
#> r1 <- rank(x, ties = 'average')
#
#> r2 <- rank_(x)
#
#> all(r1 == r2)
#[1] TRUE
#
#> library(microbenchmark)
#
#> microbenchmark(
#+   rank(x, ties = 'average')
#+   ,rank_(x)
#+   ,rank(x, ties = 'min')
#+   ,minrank(x)
#+   ,.Internal(rank(x, length(x), ties = 'ave .... [TRUNCATED] 
#Unit: microseconds
#                                            expr    min      lq     mean  median     uq    max neval
#                       rank(x, ties = "average") 13.233 14.6510 17.69987 15.3795 16.432 82.706   100
#                                        rank_(x) 23.035 25.4525 26.98596 26.3180 27.520 42.938   100
#                           rank(x, ties = "min") 13.244 14.3300 17.30062 15.1200 16.763 60.819   100
#                                      minrank(x)  2.529  3.0415  3.66911  3.4265  3.706 14.190   100
# .Internal(rank(x, length(x), ties = "average"))  1.236  1.4185  1.59032  1.4855  1.599  3.125   100
#

I wonder why the base::rank is so much slower than calling .Internal rank. minrank in Rcpp is much faster than base::rank.

My cpp code is probably horribly inefficient but it works :|

JackStat
  • 1,593
  • 1
  • 11
  • 17