Find the largest n unique values and their frequencies in R and Rcpp

Question

I have a numeric vector v (with already omitted NAs) and want to get the nth largest values and their respective frequencies.

I found http://gallery.rcpp.org/articles/top-elements-from-vectors-using-priority-queue/ to be quite fast.

// [[Rcpp::export]]
std::vector<int> top_i_pq(NumericVector v, unsigned int n)
{

typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;

for (int i = 0; i != v.size(); ++i) {
    if (pq.size() < n)
      pq.push(Elt(v[i], i));
    else {
      Elt elt = Elt(v[i], i);
      if (pq.top() < elt) {
        pq.pop();
        pq.push(elt);
      }
    }
  }

  result.reserve(pq.size());
  while (!pq.empty()) {
    result.push_back(pq.top().second + 1);
    pq.pop();
  }

  return result ;

}

However ties will not be respected. In fact I don't need the indices, returning the values would also be ok.

What I would like to get is a list containing the values and the frequencies, say something like:

numv <- c(4.2, 4.2, 4.5, 0.1, 4.4, 2.0, 0.9, 4.4, 3.3, 2.4, 0.1)

top_i_pq(numv, 3)
$lengths
[1] 2 2 1

$values
[1] 4.2 4.4 4.5

Neither getting a unique vector, nor a table, nor a (full) sort is a good idea, as n is usually small compared to the length of v (which might easily be >1e6).

Solutions so far are:

 library(microbenchmark)
 library(data.table)
 library(DescTools)

 set.seed(1789)
 x <- sample(round(rnorm(1000), 3), 1e5, replace = TRUE)
 n <- 5

 microbenchmark(
   BaseR = tail(table(x), n),
   data.table = data.table(x)[, .N, keyby = x][(.N - n + 1):.N],
   DescTools = Large(x, n, unique=TRUE),
   Coatless = ...
 )

Unit: milliseconds
       expr       min         lq       mean     median        uq       max neval
      BaseR 188.09662 190.830975 193.189422 192.306297 194.02815 253.72304   100
 data.table  11.23986  11.553478  12.294456  11.768114  12.25475  15.68544   100
  DescTools   4.01374   4.174854   5.796414   4.410935   6.70704  64.79134   100

Hmm, DescTools still fastest, but I'm sure it can be significantly improved by Rcpp (as it's pure R)!

nrussell · Accepted Answer · 2016-05-03T22:55:51.853

I'd like to throw my hat in the ring with another Rcpp-based solution, which is ~7x faster than the DescTools approach and ~13x faster than the data.table approach, using the 1e5-length x and n = 5 sample data above. The implementation is a bit lengthy, so I'll lead with the benchmark:

fn.dt <- function(v, n) {
    data.table(v = v)[
      ,.N, keyby = v
      ][(.N - n + 1):.N]
}

microbenchmark(
    "DescTools" = Large(x, n, unique=TRUE),
    "top_n" = top_n(x, 5),
    "data.table" = fn.dt(x, n),
    times = 500L
)
# Unit: microseconds
#        expr      min       lq      mean   median       uq       max neval
#   DescTools 3330.527 3790.035 4832.7819 4070.573 5323.155 54921.615   500
#       top_n  566.207  587.590  633.3096  593.577  640.832  3568.299   500
#  data.table 6920.636 7380.786 8072.2733 7764.601 8585.472 14443.401   500

Update

If your compiler supports C++11, you can take advantage of std::priority_queue::emplace for a (surprisingly) noticeable performance boost (compared to the C++98 version below). I won't post this version as it is mostly identical, save for a few calls to std::move and emplace, but here's a link to it.

Testing this against the previous three functions, and using data.table 1.9.7 (which is a bit faster than 1.9.6) yields

print(res2, order = "median", signif = 3)
# Unit: relative
#              expr  min    lq      mean median    uq   max neval  cld
#            top_n2  1.0  1.00  1.000000   1.00  1.00  1.00  1000    a   
#             top_n  1.6  1.58  1.666523   1.58  1.75  2.75  1000    b  
#         DescTools 10.4 10.10  8.512887   9.68  7.19 12.30  1000    c 
#  data.table-1.9.7 16.9 16.80 14.164139  15.50 10.50 43.70  1000    d

where top_n2 is the C++11 version.

The top_n function is implemented as follows:

#include <Rcpp.h>
#include <utility>
#include <queue>

class histogram {
private:
    struct paired {
        typedef std::pair<double, unsigned int> pair_t;

        pair_t pair;
        unsigned int is_set;

        paired() 
            : pair(pair_t()),
              is_set(0)
        {}

        paired(double x)
            : pair(std::make_pair(x, 1)),
              is_set(1)
        {}

        bool operator==(const paired& other) const {
            return pair.first == other.pair.first;
        }

        bool operator==(double other) const {
            return is_set && (pair.first == other);
        }

        bool operator>(double other) const {
            return is_set && (pair.first > other);
        }

        bool operator<(double other) const {
            return is_set && (pair.first < other);
        }

        paired& operator++() {
            ++pair.second;
            return *this;
        }

        paired operator++(int) {
            paired tmp(*this);
            ++(*this);
            return tmp;
        }
    };

    struct greater {
        bool operator()(const paired& lhs, const paired& rhs) const {
            if (!lhs.is_set) return false;
            if (!rhs.is_set) return true;
            return lhs.pair.first > rhs.pair.first;
        }
    };  

    typedef std::priority_queue<
        paired,
        std::vector<paired>,
        greater
    > queue_t;

    unsigned int sz;
    queue_t queue;

    void insert(double x) {
        if (queue.empty()) {
            queue.push(paired(x));
            return;
        }

        if (queue.top() > x && queue.size() >= sz) return;

        queue_t qtmp;
        bool matched = false;

        while (queue.size()) {
            paired elem = queue.top();
            if (elem == x) {
                qtmp.push(++elem);
                matched = true;
            } else {
                qtmp.push(elem);
            }
            queue.pop();
        }

        if (!matched) {
            if (qtmp.size() >= sz) qtmp.pop();
            qtmp.push(paired(x));
        }

        std::swap(queue, qtmp);
    }

public:
    histogram(unsigned int sz_) 
        : sz(sz_), 
          queue(queue_t())
    {}

    template <typename InputIt>
    void insert(InputIt first, InputIt last) {
        for ( ; first != last; ++first) {
            insert(*first);
        }
    }

    Rcpp::List get() const {
        Rcpp::NumericVector values(sz);
        Rcpp::IntegerVector freq(sz);
        R_xlen_t i = 0;

        queue_t tmp(queue);
        while (tmp.size()) {
            values[i] = tmp.top().pair.first;
            freq[i] = tmp.top().pair.second;
            ++i;
            tmp.pop();
        }

        return Rcpp::List::create(
            Rcpp::Named("value") = values,
            Rcpp::Named("frequency") = freq);
    }
};


// [[Rcpp::export]]
Rcpp::List top_n(Rcpp::NumericVector x, int n = 5) {
    histogram h(n);
    h.insert(x.begin(), x.end());
    return h.get();
}

There's a lot going on in the histogram class above, but just to touch on some of the key points:

The paired type is essentially a wrapper class around an std::pair<double, unsigned int>, which associates a value with a count, providing some convenience features such as operator++() / operator++(int) for direct pre-/post-increment of the count, and modified comparison operators.
The histogram class wraps a sort of "managed" priority queue, in the sense that the size of std::priority_queue is capped at a particular value sz.
Instead of using the default std::less ordering of std::priority_queue, I'm using a greater-than comparator so that candidate values can be checked against std::priority_queue::top() to quickly determine whether they should (a) be discarded, (b) replace the current minimum value in the queue, or (c) update the count of one of the existing values in the queue. This is only possible because the size of the queue is being restricted to <= sz.

Nice! @AndriSignorell this is probably the solution you want. — coatless, May 03 '16 at 17:35
This is indeed impressive! Thanks very much for your creative solution! One point, the C++11 solution gives me these errors: top_n2.cpp: In member function 'void histogram::insert(double)': top_n2.cpp:71:13: error: 'histogram::queue_t' has no member named 'emplace' queue.emplace(x); ^ top_n2.cpp:83:14: error: 'histogram::queue_t' has no member named 'emplace' qtmp.emplace(std::move(++elem)); ^ Any idea what could be missing? (I have newest R-Devel and newest Rtools) — Andri Signorell, May 04 '16 at 05:57
@AndriSignorell not quite, but I do recognize good work and what nrussel did is very, very good. Unfortunately, I've just been a bit more busy than I like. If you are on Windows you need to change the path line in Rtools to use the new compiler. — coatless, May 04 '16 at 06:11
Hmm, I have the new compiler: c:/Rtools/mingw_64/bin/g++ -I"C:/PROGRA~1/R/R-devel/include" -DNDEBUG -I"C:/Program Files/R/R-devel/library/Rcpp/include" -I"d:/Compiler/gcc-4.9.3/local330/include" -O2 -Wall -mtune=core2 -c ntop2_n.cpp -o ntop2_n.o — Andri Signorell, May 04 '16 at 07:05
Did you catch the very top line in that file, `// [[Rcpp::plugins(cpp11)]]`? — nrussell, May 04 '16 at 11:56
I'm guessing not, since your compiler output would have included "-std=c++11". The attribute `// [[Rcpp::plugins(cpp11)]]` is needed in source files utilizing C++11 features to enable to correct compiler flags. — nrussell, May 04 '16 at 15:25
You're right, I missed that. But it's still not compiling: top_n.cpp: In member function 'void histogram::insert(double)': top_n.cpp:71:13: error: 'histogram::queue_t' has no member named 'emplace' queue.emplace(x); ^ Any further idea? — Andri Signorell, May 04 '16 at 23:24
Besides, I'd like to integrate that in DescTools. I would cite you as Nathan Russell with a link to this stackoverflow post, correct so? — Andri Signorell, May 04 '16 at 23:26
Re: citation - that should be fine; I'm not sure what the "correct" way to do this is (if there even is one), but that is certainly fine by me. As for the compilation error, that's a little puzzling because `emplace` has been implemented in g++ since ~4.8.0 or so. Could you compile & run [this code](https://gist.github.com/nathan-russell/da64f0e4b4a74b0d2620b15f3ee11c90#file-cpp-version-cpp) and reply with the number it outputs? I don't have a Windows machine available at the moment, but I will try to test this on one tomorrow morning and see if I get the same error. — nrussell, May 05 '16 at 00:15
Yes it looks like your compiler is still using the C++98 specification. — nrussell, May 08 '16 at 20:21
Ok, I got that running and integrated. Last question: What has to be changed, when the smallest 5 elements should be returned instead the top 5? — Andri Signorell, May 12 '16 at 06:55
[Here's a variation](https://gist.github.com/nathan-russell/d41699e49b5c2a627d2fb6a7867a6bd3#file-bottom_n-cpp11-cpp) that does that; all that is needed is to change the [comparator used by the queue](https://gist.github.com/nathan-russell/d41699e49b5c2a627d2fb6a7867a6bd3#file-bottom_n-cpp11-cpp-L52-L64) and the comparison used to [assess candidate values](https://gist.github.com/nathan-russell/d41699e49b5c2a627d2fb6a7867a6bd3#file-bottom_n-cpp11-cpp-L75). — nrussell, May 12 '16 at 13:44
We've almost reached the end! One thing I noticed just now: If size is smaller than k, the function returns zeros up to the number of k. For example Large(c(3,4,4,4), k=5) would return $value [1] 3 4 0 0 0 $frequency [1] 1 3 0 0 0 Could this be handled within the function? — Andri Signorell, May 13 '16 at 09:18
Yes sorry -- I programmed this logic into the `paired` class and forgot to take advantage of it. You can replace `histogram::get` in both cases (`top_n` & `bottom_n`) [with this version](https://gist.github.com/nathan-russell/980a766f3676a5c6e6a670b199794683#file-histogram-get-revised-cpp). — nrussell, May 13 '16 at 12:55
Great, with that I think we have it done so far. I'll publish a DescTools update within the next few days containing the new function. Thanks very much for your work! — Andri Signorell, May 13 '16 at 14:06
Happy to help. You may also want to consider hosting your package on GitHub at some point in the future so that it is easier to collaborate with others; just a suggestion though. — nrussell, May 13 '16 at 14:22

MichaelChirico · Answer 2 · 2016-05-03T01:48:24.107

4

I'd wager data.table is competitive:

library(data.table)

data <- data.table(v)

data[ , .N, keyby = v][(.N - n + 1):.N]

where n is the number you want to get

edited May 03 '16 at 01:48

answered May 03 '16 at 01:09

MichaelChirico

33,841
14
113
198

1

Thanks for your answer! You're solution is indeed relatively competitive, in fact the fastest I saw so far. – Andri Signorell May 03 '16 at 06:03
This is indeed the fastest so far. – Kunal Puri May 03 '16 at 07:06
@AndriSignorell are you running the development version of `data.table`? If not, it should be even faster in devel since `[ , .N, keyby]` has recently been optimized a bit. Install with `install.packages("data.table", type = "source", repos = "https://Rdatatable.github.io/data.table")` – MichaelChirico May 03 '16 at 18:17
Hmm, see nrussell's benchmark, where data.table still is 16 times slower, than his (best) solution. But still faster than base R... – Andri Signorell May 04 '16 at 06:25
no worries! there's still lots of overhead in R so a true optimized solution must be faster than `data.table`, and probably considerably so... the advantage of this solution is still its simplicity – MichaelChirico May 04 '16 at 12:16
Point well made, still fastest solution without using a function! ;-) – Andri Signorell May 04 '16 at 23:10
@AndriSignorell and what's better, there are a few feature requests out there that should speed up the `data.table` solution, e.g., [#1415](https://github.com/Rdatatable/data.table/issues/1415) and [#1244](https://github.com/Rdatatable/data.table/issues/1244) (since I imagine a lot of the hang-up is due to the `data.table(v = v)` step...) – MichaelChirico May 04 '16 at 23:16

coatless · Answer 3 · 2016-05-04T17:57:05.167

Note: The previous version replicated functionality for table() and not the target. This version has been removed and will be available off-site.

Map plan of attack

Below is a solution using a map.

C++98

First of all, we need to find the "unique" values for the vector of numbers.

To do this, we opt to store the number being counted as a key within a std::map and increment the value each time we observe that number.

Using the ordering structure of the std::map, we know that the top n numbers are at the back of the std::map. Thus, we use an iterator to pop those elements and export them in an array.

C++11

If one has access to a C++11 compiler, an alternative is to use std::unordered_map, which has a Big O of O(1) for insertion and retrieval ( O(n) if bad hashes) vs. std::map which has a Big O of O(log(n)).

To obtain the correct top n, one would then use std::partial_sort() to do so.

The Implementation

C++98

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::List top_n_map(const Rcpp::NumericVector & v, int n)
{

  // Initialize a map
  std::map<double, int> Elt;

  Elt.clear();

  // Count each element
  for (int i = 0; i != v.size(); ++i) {
    Elt[ v[i] ] += 1;
  }

  // Find out how many unique elements exist... 
  int n_obs = Elt.size();

  // If the top number, n, is greater than the number of observations,
  // then drop it.  
  if(n > n_obs ) { n = n_obs; }

  // Pop the last n elements as they are already sorted. 

  // Make an iterator to access map info
  std::map<double,int>::iterator itb = Elt.end();

  // Advance the end of the iterator up to 5.
  std::advance(itb, -n);

  // Recast for R
  Rcpp::NumericVector result_vals(n);

  Rcpp::NumericVector result_keys(n);

  unsigned int count = 0;

  // Start at the nth element and move to the last element in the map.
  for( std::map<double,int>::iterator it = itb; it != Elt.end(); ++it )
  {
    // Move them into split vectors
    result_keys(count) = it->first;
    result_vals(count) = it->second;

    count++;
  }

  return Rcpp::List::create(Rcpp::Named("lengths") = result_vals,
                            Rcpp::Named("values") = result_keys);
}

Short Test

Let's verify that it works by running over some data:

# Set seed for reproducibility
set.seed(1789)
x <- sample(round(rnorm(1000), 3), 1e5, replace = TRUE)
n <- 5

And now we seek to obtain the occurrence information:

# Call our function
top_n_map(a)

Gives us:

$lengths
[1] 101 104 101 103 103

$values
[1] 2.468 2.638 2.819 3.099 3.509

Benchmarks

Unit: microseconds
       expr        min          lq        mean      median         uq        max neval
      BaseR 112750.403 115946.7175 119493.4501 117676.2840 120712.595 166067.530   100
 data.table   6583.851   6994.3665   8311.8631   7260.9385   7972.548  47482.559   100
  DescTools   3291.626   3503.5620   5047.5074   3885.4090   5057.666  43597.451   100
   Coatless   6097.237   6240.1295   6421.1313   6365.7605   6528.315   7543.271   100
nrussel_c98    513.932    540.6495    571.5362    560.0115    584.628    797.315   100
nrussel_c11    489.616    512.2810    549.6581    533.2950    553.107    961.221   100

As we can see, this implementation beats out data.table, but falls victim to DescTools and @nrussel's attempts.

Thanks for the promising suggestion. However your microbenchmark (14'000) is ~ 3 times slower than the data.table solution (5'000) below whereas the RCpp-Gallery example has a value of ~700. As far as I understand your solution you're in principle tabling the whole vector and afterwards sorting the full vector. This is done relatively fast, but it's expensive and a waste of ressources, as we need only the nth (typically n=5) largest values. How would you extend your function by the number of needed values: top_i_pq(a, n)? — Andri Signorell, May 03 '16 at 06:01
@AndriSignorell missed that on first glance. Sorry about that. Yes, I do have a way of dropping it down a bit. Let me write it up tomorrow at / after lunch. — coatless, May 03 '16 at 06:35
@AndriSignorell by the way, could you provide a test data set size / generation conditions (e.g. `a = sample(seq(1, 10, length.out = 1000), 1e6))` ) — coatless, May 03 '16 at 08:11
Combining the priority queue approach (from Rcpp gallery) with your map structure might be a promising concept for winning the race ... But how? — Andri Signorell, May 03 '16 at 12:22
Nice try and definitely less code than nrussell. But concerning Performance we have a winner: nrussell! congrats! Thanks to all contributers! — Andri Signorell, May 04 '16 at 23:14

Find the largest n unique values and their frequencies in R and Rcpp

3 Answers3

Map plan of attack

C++98

C++11

The Implementation

C++98

Short Test

Benchmarks