The usual caveats of not much experience with C++ apply. I need to calculate the equivalent of hist(x, breaks=breaks, plot=FALSE)$counts
in Rcpp.
I've written the following Rcpp function to calculate frequencies:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector get_freq(NumericVector x, NumericVector breaks) {
int nbreaks = breaks.size();
NumericVector out(nbreaks-1);
for (int i=0; i<nbreaks-1; i++) {
LogicalVector temp = (x>breaks(i)) & (x<=breaks(i+1));
out[i] = sum(temp);
}
return(out);
}
The function is called multiple times by another Rcpp function.
The problem is that the run time increases linearly with the length of x
:
breaks <- seq(from=0, to=max(x)+1, length.out=101)
library(microbenchmark)
microbenchmark(get_freq(runif(100, 1, 100), breaks),
get_freq(runif(1000, 1, 100), breaks),
get_freq(runif(3000, 1, 100), breaks))
Unit: microseconds
expr min lq mean median uq max neval cld
get_freq(runif(100, 1, 100), breaks) 176.420 184.611 190.1675 188.415 191.633 313.927 100 a
get_freq(runif(1000, 1, 100), breaks) 1700.119 1714.309 1807.4252 1732.302 1809.687 5564.958 100 b
get_freq(runif(3000, 1, 100), breaks) 5134.003 5157.701 5342.2800 5177.157 5434.180 9242.844 100 c
get_freq
is called multiple times with x
typically of length 3000+, and causes a bottleneck in the Rcpp code that is otherwise much faster than the R equivalent.
Any suggestions for ways to improve the speed of get_freq
?
Update
After posting this question I realized I should be searching for 'C++ histogram' instead of 'C++ frequency'. I found this answer which I thought did the job. Unfortunately it doesn't.
I need the frequency function to return a vector of fixed length (i.e nbreaks) as above. The linked answer doesn't do this - it only returns counts of observed values