First, please make examples reproducible. The above lacks Rcpp Attributes, C++11 plugin, and necessary header imports.
Second, the issue that is being displayed here is the cost of performing a deep copy of the data out of R to a C++ structure. The majority of time in your benchmark is being spent in the copy process. This process is triggered by the choice of using an std::vector<std::string>
instead of an Rcpp::CharacterVector
, which holds the SEXP
, s-expression, or a pointer to the data. By negating the proxy model offered by Rcpp objects, which only performs shallow copies, there will be an immediate cost to data importation into C++ that is much larger than mere microseconds described within Why is this simplistic cpp function version slower?.
Having said that, let's talk how to modify the above example to use Rcpp objects. First, note that Rcpp objects have a member function called .sort()
that accurately sorts an Rcpp::CharacterVector
with missing values (see Rcpp FAQ Section 5.5: Sorting with STL on a CharacterVector produces problematic results for details this assumes no capitalize or special locale). Secondly, the SEXP
type can be used as a way to construct the std::unordered_set
even with the data being imported as an Rcpp::CharacterVector
. These modifications can be found in the C++ functions with "native" in their declaration.
#include <Rcpp.h>
#include <unordered_set>
#include <algorithm>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
int unique_sort(std::vector<std::string> x) {
sort(x.begin(), x.end());
return unique(x.begin(), x.end()) - x.begin();
}
// [[Rcpp::export]]
int unique_set(std::vector<std::string> x) {
std::unordered_set<std::string> tab(x.begin(), x.end());
return tab.size();
}
// [[Rcpp::export]]
int unique_sort_native(Rcpp::CharacterVector x) {
x.sort();
return std::unique(x.begin(), x.end()) - x.begin();
}
// [[Rcpp::export]]
int unique_set_native(Rcpp::CharacterVector x) {
std::unordered_set<SEXP> tab(x.begin(), x.end());
return tab.size();
}
Test Code:
# install.packages(c("microbenchmark"))
# Note, it is more efficient to supply an integer rather than a vector
# in sample()'s first parameter.
x <- paste0("x", sample(1e5, 1e7, replace=T))
# Run a microbenchmark
microbenchmark::microbenchmark(
length(unique(x)),
length(unique.default(x)),
unique_sort(x),
unique_set(x),
unique_sort_native(x),
unique_set_native(x),
times = 10
)
Output:
Unit: milliseconds
expr min lq mean median uq max neval
length(unique(x)) 208.0 235.3 235.7 237.2 240.2 247.4 10
length(unique.default(x)) 230.9 232.8 238.8 233.7 241.8 266.6 10
unique_sort(x) 12759.4 12877.1 12993.8 12920.1 13043.2 13416.7 10
unique_set(x) 2528.1 2545.3 2590.1 2590.3 2631.3 2670.1 10
unique_sort_native(x) 7452.6 7482.4 7568.5 7509.0 7563.6 7917.8 10
unique_set_native(x) 175.8 176.9 179.2 178.3 182.3 183.4 10
So, when avoiding a deep copy by using Rcpp objects, the unique_set_native
function beats out the length(unique())
call by around 30 milliseconds.