3

I'm a newbie to C++ and Rcpp. Suppose, I have a vector

t1<-c(1,2,NA,NA,3,4,1,NA,5)

and I want to get a index of elements of t1 that are NA. I can write:

NumericVector retIdxNA(NumericVector x) {

    // Step 1: get the positions of NA in the vector
    LogicalVector y=is_na(x);

    // Step 2: count the number of NA
    int Cnt=0;
    for (int i=0;i<x.size();i++) {
       if (y[i]) {
         Cnt++;
       }
    }

    // Step 3: create an output matrix whose size is same as that of NA
    // and return the answer
    NumericVector retIdx(Cnt);
    int Cnt1=0;
    for (int i=0;i<x.size();i++) {
       if (y[i]) {
          retIdx[Cnt1]=i+1;
          Cnt1++;
       }
    }
    return retIdx;
}

then I get

retIdxNA(t1)
[1] 3 4 8

I was wondering:

(i) is there any equivalent of which in Rcpp?

(ii) is there any way to make the above function shorter/crisper? In particular, is there any easy way to combine the Step 1, 2, 3 above?

uday
  • 6,453
  • 13
  • 56
  • 94

4 Answers4

15

Recent version of RcppArmadillo have functions to identify the indices of finite and non-finite values.

So this code

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::uvec whichNA(arma::vec x) {
  return arma::find_nonfinite(x);
}

/*** R
t1 <- c(1,2,NA,NA,3,4,1,NA,5)
whichNA(t1)
*/

yields your desired answer (module the off-by-one in C/C++ as they are zero-based):

R> sourceCpp("/tmp/uday.cpp")

R> t1 <- c(1,2,NA,NA,3,4,1,NA,5)

R> whichNA(t1)
     [,1]
[1,]    2
[2,]    3
[3,]    7
R> 

Rcpp can do it too if you first create the sequence to subset into:

// [[Rcpp::export]]
Rcpp::IntegerVector which2(Rcpp::NumericVector x) {
  Rcpp::IntegerVector v = Rcpp::seq(0, x.size()-1);
  return v[Rcpp::is_na(x)];
}

Added to code above it yields:

R> which2(t1)
[1] 2 3 7
R> 

The logical subsetting is also somewhat new in Rcpp.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
7

Try this:

#include <Rcpp.h> 
using namespace Rcpp; 

// [[Rcpp::export]]
IntegerVector which4( NumericVector x) {

    int nx = x.size();
    std::vector<int> y;
    y.reserve(nx);

    for(int i = 0; i < nx; i++) {
        if (R_IsNA(x[i])) y.push_back(i+1);
    }

    return wrap(y);
}

which we can run like this in R:

> which4(t1)
[1] 3 4 8

Performance

Note that we have changed the above solution to reserve space for the output vector. This replaces which3 which is:

// [[Rcpp::export]]
IntegerVector which3( NumericVector x) {
    int nx = x.size();
    IntegerVector y;
    for(int i = 0; i < nx; i++) {
        // if (internal::Rcpp_IsNA(x[i])) y.push_back(i+1);
        if (R_IsNA(x[i])) y.push_back(i+1);
    }
    return y;
}

Then the performance on a vector 9 elements long is the following with which4 the fastest:

> library(rbenchmark)
> benchmark(retIdxNA(t1), whichNA(t1), which2(t1), which3(t1), which4(t1), 
+    replications = 10000, order = "relative")[1:4]
          test replications elapsed relative
5   which4(t1)        10000    0.14    1.000
4   which3(t1)        10000    0.16    1.143
1 retIdxNA(t1)        10000    0.17    1.214
2  whichNA(t1)        10000    0.17    1.214
3   which2(t1)        10000    0.25    1.786

Repeating this for a vector 9000 elements long the Armadillo solution comes in quite a bit faster than the others. Here which3 (which is the same as which4 except it does not reserve space for the output vector) comes in worst while which4 comes second.

> tt <- rep(t1, 1000)
> benchmark(retIdxNA(tt), whichNA(tt), which2(tt), which3(tt), which4(tt), 
+   replications = 1000, order = "relative")[1:4]
          test replications elapsed relative
2  whichNA(tt)         1000    0.09    1.000
5   which4(tt)         1000    0.79    8.778
3   which2(tt)         1000    1.03   11.444
1 retIdxNA(tt)         1000    1.19   13.222
4   which3(tt)         1000   23.58  262.000
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Sure, but why not use the non-loop two-liner I posted? Add a +1 offset at will to the sequence if you think that matters. – Dirk Eddelbuettel May 25 '14 at 01:23
  • For compactness and readability one of your solutions would be better but if speed is an issue then I would suspect this solution is faster as it seems its often the case that Rcpp runs faster with devectorized code. Also its closer to the poster's solution. Note that I did give you one of your upvotes. – G. Grothendieck May 25 '14 at 02:27
  • How about if you measured it? – Dirk Eddelbuettel May 25 '14 at 03:01
  • OK. Added them. `which4` wins on the basis of speed for small vectors and `whichNA` for large. – G. Grothendieck May 25 '14 at 03:42
  • @Dirk, By the way, originally I had omitted `wrap` in `which4` but that failed to compile. Why does it not convert the return value automatically without `wrap`? Also would it be feasible to add a `reserve` method to `NumericVector` as it seems to make a huge difference on long vectors? – G. Grothendieck May 25 '14 at 03:45
  • This has all been discussed before. `std::vector<>` != `SEXP`-based R vectors, so `wrap()` is needed and `reserve()` would likely involve inefficient copying. And nice to see that the Armadillo version looks pretty good on vectors that matter. Personally, I would not worry too much about `n=9`. – Dirk Eddelbuettel May 25 '14 at 03:52
  • If the reservation of space could be specified at the time that the vector is created that would not involve copying. – G. Grothendieck May 25 '14 at 12:35
  • We have `Rcpp::NumericVector(int n)` just like standard STL vectors, and it is fairly common to use the size when known. Is that what you are looking for? – Dirk Eddelbuettel May 25 '14 at 13:00
  • I was looking for something that allocated the space but did not make the vector that large so that expanding the vector up to that size will not have a significant impact. `data.table` has something like that. An R data.table has a length (no of cols) and a truelength (no of cols that it can expand to without significant impact). std::vector in C++ has this with reserve. – G. Grothendieck May 25 '14 at 13:18
  • Actually I believe that R vectors have a true length too in addition to the actual length although the true length cannot be accessed from R. – G. Grothendieck May 25 '14 at 13:24
  • @G.Grothendieck the "true length" thing does not really do what the name implies. It is mainly used in R with hash tables but has nothing to do with some sort of maximum capacity as the name implies. – Romain Francois May 25 '14 at 13:32
4

All of the solutions above are serial. Although not trivial, it is quite possible to take advantage of threading for implementing which. See this write up for more details. Although for such small sizes, it would not more harm than good. Like taking a plane for a small distance, you lose too much time at airport security..

R implements which by allocating memory for a logical vector as large as the input, does a single pass to store the indices in this memory, then copy it eventually into a proper logical vector.

Intuitively this seems less efficient than a double pass loop, but not necessarily, as copying a data range is cheap. See more details here.

Romain Francois
  • 17,432
  • 3
  • 51
  • 77
  • There somewhere. https://github.com/romainfrancois/Rcpp-book/tree/master/chapters/threading/code The code might need a refresh as it is 4 months old. Note however that it needs the `` header which AFAIK is not available on current Rtools windows toolset .. – Romain Francois May 25 '14 at 13:44
  • This all very nice, and you may enjoy the [keynote Luke gave at our conference](http://www.rinfinance.com/agenda/2014/talk/LukeTierney.pdf) as there will be similar approaches based on OpenMP in R "at some point", but also entire orthogonal to what the OP asked for: a solution for Rcpp (not Rcpp11) that can be used today, – Dirk Eddelbuettel May 25 '14 at 14:30
  • You can use Rcpp11 today. – Romain Francois May 25 '14 at 14:55
  • Just how you can use C++11 in Rcpp today too. But you still haven't given the OP an answer he can use. – Dirk Eddelbuettel May 25 '14 at 15:04
  • @DirkEddelbuettel, what is the main difference between Rcpp and Rcpp11? – uday May 26 '14 at 22:21
  • @uday the main difference is who pulls the strings and as a consequence how each evolve. Rcpp11 is a complete rewrite. Only keeping the best parts, and focused on modern C++, and innovating. OTOH, Rcpp is stable and compatible with older C++ standard. – Romain Francois May 27 '14 at 00:09
-4

Just write a function for yourself like:

which_1<-function(a,b){
return(which(a>b))
}

Then pass this function into rcpp.

www139
  • 4,960
  • 3
  • 31
  • 56
guagua
  • 1