2

I'm trying to write a function to count the number of unique items in a string vector (my problem is slightly more complex, but this is reproducible. I did this based on answers I found for C++. Here is my code:

C++

int unique_sort(vector<string> x) {
    sort(x.begin(), x.end());
    return unique(x.begin(), x.end()) - x.begin();
}

int unique_set(vector<string> x) {
    unordered_set<string> tab(x.begin(), x.end());
    return tab.size();
}

R:

x <- paste0("x", sample(1:1e5, 1e7, replace=T))
microbenchmark(length(unique(x)),unique_sort(x), unique_set(x), times=3)

Results:

Unit: milliseconds
              expr        min         lq       mean     median         uq
 length(unique(x))   365.0213   373.4018   406.0209   381.7823   426.5206
    unique_sort(x) 10732.1918 10847.0532 10907.6882 10961.9146 10995.4363
     unique_set(x)  1948.6517  2230.3383  2334.4040  2512.0249  2527.2802

Looking at the R source code for the unique function (it is a bit hard to follow), it seems to use a loop over the array to add unique elements to a hash and checks that hash if element already exist.

Therefore, I thought it should be equivalent to the unordered_set approach. I don't understand why the unordered_set approach is 5 times slower.

TLDR: why is my C++ code slow?

thc
  • 9,527
  • 1
  • 24
  • 39
  • What is "highly optimized compiled code" and how does one achieve that? Rcpp is already compiled. Also other functions can be written to performance parity with compiled R code (e.g., the tabulate function). I feel that I am missing something algorithmic to be 5 times slower. – thc Jul 11 '17 at 03:12
  • Dirk explains it much better [here](https://stackoverflow.com/q/39757374/4408538). I think this is a similar situation. – Joseph Wood Jul 11 '17 at 03:20
  • In that post, the difference due to overhead is measured in nanoseconds, not particularly relevant here where the difference is seconds. I can show you with other R functions where the difference is much smaller than seconds. – thc Jul 11 '17 at 03:24
  • I know this doesn't answer your question as to why, but here is an `Rcpp` function that is about 5x faster than base R: `int unique_size(CharacterVector x) {return Rcpp::unique(x).size();}`. I keep looking into the source code and can't really find anything. It's interesting indeed. – Joseph Wood Jul 11 '17 at 03:51
  • Thanks, I appreciate it. Hopefully someone can answer the "why" question :) – thc Jul 11 '17 at 03:54

1 Answers1

10

First, please make examples reproducible. The above lacks Rcpp Attributes, C++11 plugin, and necessary header imports.

Second, the issue that is being displayed here is the cost of performing a deep copy of the data out of R to a C++ structure. The majority of time in your benchmark is being spent in the copy process. This process is triggered by the choice of using an std::vector<std::string> instead of an Rcpp::CharacterVector, which holds the SEXP, s-expression, or a pointer to the data. By negating the proxy model offered by Rcpp objects, which only performs shallow copies, there will be an immediate cost to data importation into C++ that is much larger than mere microseconds described within Why is this simplistic cpp function version slower?.

Having said that, let's talk how to modify the above example to use Rcpp objects. First, note that Rcpp objects have a member function called .sort() that accurately sorts an Rcpp::CharacterVector with missing values (see Rcpp FAQ Section 5.5: Sorting with STL on a CharacterVector produces problematic results for details this assumes no capitalize or special locale). Secondly, the SEXP type can be used as a way to construct the std::unordered_set even with the data being imported as an Rcpp::CharacterVector. These modifications can be found in the C++ functions with "native" in their declaration.

#include <Rcpp.h>
#include <unordered_set>
#include <algorithm>

// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
int unique_sort(std::vector<std::string> x) {
  sort(x.begin(), x.end());
  return unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set(std::vector<std::string> x) {
  std::unordered_set<std::string> tab(x.begin(), x.end());
  return tab.size();
}

// [[Rcpp::export]]
int unique_sort_native(Rcpp::CharacterVector x) {
  x.sort();
  return std::unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set_native(Rcpp::CharacterVector x) {
  std::unordered_set<SEXP> tab(x.begin(), x.end());
  return tab.size();
}

Test Code:

# install.packages(c("microbenchmark"))

# Note, it is more efficient to supply an integer rather than a vector
# in sample()'s first parameter.
x <- paste0("x", sample(1e5, 1e7, replace=T))

# Run a microbenchmark
microbenchmark::microbenchmark(
  length(unique(x)),
  length(unique.default(x)),
  unique_sort(x),
  unique_set(x),
  unique_sort_native(x),
  unique_set_native(x),
  times = 10
)

Output:

Unit: milliseconds
                      expr     min      lq    mean  median      uq     max neval
         length(unique(x))   208.0   235.3   235.7   237.2   240.2   247.4    10
 length(unique.default(x))   230.9   232.8   238.8   233.7   241.8   266.6    10
            unique_sort(x) 12759.4 12877.1 12993.8 12920.1 13043.2 13416.7    10
             unique_set(x)  2528.1  2545.3  2590.1  2590.3  2631.3  2670.1    10
     unique_sort_native(x)  7452.6  7482.4  7568.5  7509.0  7563.6  7917.8    10
      unique_set_native(x)   175.8   176.9   179.2   178.3   182.3   183.4    10

So, when avoiding a deep copy by using Rcpp objects, the unique_set_native function beats out the length(unique()) call by around 30 milliseconds.

coatless
  • 20,011
  • 13
  • 69
  • 84
  • Can you add a base R approach that calls the S3 methods directly instead of the S3 generics to the benchmarks? It would be interesting if Rcpp is still faster if you avoid the method dispatch (I expect it isn't). – Roland Jul 11 '17 at 08:25
  • @Roland, looks like removing the dispatch didn't do much in this case. I upped the amount of evaluation attempts to further investigate... Potentially, the additional time difference could be attributed to the `length()` call in _R_? – coatless Jul 11 '17 at 09:11
  • length is also a generic. But it doesn't seem like that fully explains the difference. Mmh. – Roland Jul 11 '17 at 09:22
  • Nice answer, once again! – Dirk Eddelbuettel Jul 11 '17 at 10:35
  • Thank you, I really appreciate your explanation and code examples! – thc Jul 11 '17 at 19:45