I have a numeric vector v (with already omitted NAs) and want to get the nth largest values and their respective frequencies.
I found http://gallery.rcpp.org/articles/top-elements-from-vectors-using-priority-queue/ to be quite fast.
// [[Rcpp::export]]
std::vector<int> top_i_pq(NumericVector v, unsigned int n)
{
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
return result ;
}
However ties will not be respected. In fact I don't need the indices, returning the values would also be ok.
What I would like to get is a list containing the values and the frequencies, say something like:
numv <- c(4.2, 4.2, 4.5, 0.1, 4.4, 2.0, 0.9, 4.4, 3.3, 2.4, 0.1)
top_i_pq(numv, 3)
$lengths
[1] 2 2 1
$values
[1] 4.2 4.4 4.5
Neither getting a unique vector, nor a table, nor a (full) sort is a good idea, as n is usually small compared to the length of v (which might easily be >1e6).
Solutions so far are:
library(microbenchmark)
library(data.table)
library(DescTools)
set.seed(1789)
x <- sample(round(rnorm(1000), 3), 1e5, replace = TRUE)
n <- 5
microbenchmark(
BaseR = tail(table(x), n),
data.table = data.table(x)[, .N, keyby = x][(.N - n + 1):.N],
DescTools = Large(x, n, unique=TRUE),
Coatless = ...
)
Unit: milliseconds
expr min lq mean median uq max neval
BaseR 188.09662 190.830975 193.189422 192.306297 194.02815 253.72304 100
data.table 11.23986 11.553478 12.294456 11.768114 12.25475 15.68544 100
DescTools 4.01374 4.174854 5.796414 4.410935 6.70704 64.79134 100
Hmm, DescTools still fastest, but I'm sure it can be significantly improved by Rcpp (as it's pure R)!