Efficiently find duplicated vectors in C++

Question

I'm working on a R package dealing with 3D meshes. I have a function which constructs a parametric mesh but for certain parameterizations it can generate duplicated vertices (numeric vectors of length 3). I need to find the duplicated vertices in order to merge them into a single vertex. I implemented an algorithm with Rcpp to do that.

There's a function for that in the Rvcg package: vcgClean(mymesh, sel = 0). It is very fast, and as compared to it, my C++ algorithm is very slow. Similary, the R base function duplicated is highly faster.

Here is how my algorithm works. Say there are n vertices.

I take vertex 1, I compare it to vertices 2, 3, ..., n.
I take vertex 2, I compare it to vertices 3, 4, ..., n. Note that I could skip this comparison in case if this vertex has been marked as a duplicate of vertex 1 at the previous step. I don't do that. Anyway there's usually a small number of duplicates, so this is not the cause of the slowness.
And so on, vertex 3 is compared to vertices 4, 5, ..., n. Etc.

To compare two vertices, I test the near-equality of each coordinate:

bool test = nearEqual(v1(0), v2(0)) && nearEqual(v1(1), v2(1)) && nearEqual(v1(2), v2(2));

Before, I tested exact equality, this was slow as well. My nearEqual function is:

bool nearEqual(double l, double r) {
  return r == std::nextafter(l, r);
}

Why is my algorithm so slow? Is it due to the n-1 + n-2 + ... + 1 comparisons strategy? I don't see how I could do less comparisons. Or is it due to the nearEqual function? Or what? The vertices are stored in a 3 x n numeric matrix in Rcpp.

Edit

I've just tried another way to test near-equality of two columns but this is still slow:

const Rcpp::NumericMatrix::Column& v1 = Vertices.column(i);
const Rcpp::NumericMatrix::Column& v2 = Vertices.column(j);
bool test = Rcpp::max(Rcpp::abs(v1 - v2)) < 1e-16;

Does this answer your question? [What's the most efficient way to erase duplicates and sort a vector?](https://stackoverflow.com/questions/1041620/whats-the-most-efficient-way-to-erase-duplicates-and-sort-a-vector) — JRR, Jun 10 '23 at 09:15
The `collapse` package has a function `fduplicated()` which may be worth looking at. — NicChr, Jun 10 '23 at 09:45
Your algorithm has a complexity of `O(n^2)` but you should be able to get to `O(n * log(n))` by sorting the vertices by on coordingate. Probably you could adjust the algorithm for finding the closest pair of points: (2D version of the approach described here: https://www.cs.ubc.ca/~liorma/cpsc320/files/closest-points.pdf ) An alternative approach narrowing down the number of points to create a map of points mapping from `(n1, n2, n3)` to a list of points where `n1, n2, n3` are integers such that `x in [n1 * 1e-16, (n1+1) * 1e-16)`, ect. which narrows down the lists to check to 27 keys... — fabian, Jun 10 '23 at 10:19
As noted, sorting is critical to reduce the number of comparison. Putting them in a kdtree and then looking for nearest neighbors of each point is often used and would keep you at nlogn complexity. — mattking, Jun 10 '23 at 16:27

Mikko Marttila · Accepted Answer · 2023-06-10T18:55:30.153

2

You could keep track of discovered vertex positions in a hash map and duplicated vertices in a set to only have to do one pass:

#include <Rcpp/Lightest>
#include <map>
#include <set>
#include <tuple>

typedef std::tuple<double, double, double> Point3;

// [[Rcpp::export()]]
Rcpp::List duplicated_vertices(Rcpp::NumericMatrix x) {
    std::map<Point3, Rcpp::IntegerVector> positions;
    std::set<Point3> duplicates;

    for (R_len_t i = 0; i < x.ncol(); ++i) {
        Point3 vertex = { x(0, i), x(1, i), x(2, i) };
        if (positions.count(vertex) == 0) {
            Rcpp::IntegerVector position = { i + 1 };
            positions.emplace(vertex, position);
        } else {
            duplicates.insert(vertex);
            positions.at(vertex).push_back(i + 1);
        }
    }

    Rcpp::List result;
    for (const Point3& vertex: duplicates) {
        result.push_back(positions.at(vertex));
    }

    return result;
}

On my machine I get a roughly 10-fold speedup compared to duplicated():

> set.seed(123)
> N <- 150
> M <- matrix(sample(c(1, 2, 3), N * 3, replace = TRUE), 3, N)
> 
> microbenchmark::microbenchmark(
+   Rcpp = dupes(M),
+   R = duplicated(t(M), fromLast = TRUE),
+   cpp_map = duplicated_vertices(M)
+ )
Unit: microseconds
    expr   min     lq    mean median    uq    max neval
    Rcpp 458.7 468.50 490.049  476.6 487.5 1092.7   100
       R 329.2 339.45 408.056  347.6 361.7 5103.3   100
 cpp_map  45.1  47.60  59.972   48.9  52.0  880.7   100

edited Jun 10 '23 at 18:55

answered Jun 10 '23 at 18:45

Mikko Marttila

10,972
18
31

Terrific. Thanks. I'm just wondering how to adapt this code to test almost-equality instead of exact equality. – Stéphane Laurent Jun 10 '23 at 19:54
Maybe simply by rounding the vertices... – Stéphane Laurent Jun 10 '23 at 20:03
@StéphaneLaurent yeah rounding is the only thing I could think of too. – Mikko Marttila Jun 10 '23 at 20:04
1

Very nice! Only comment is that I generally recommend against `push_back` on R data object (the final `Rcpp::List`) as they are _not_ STL objects. Preallocate and inject at position is probably better. – Dirk Eddelbuettel Jun 10 '23 at 21:12
1

Also note that the speed up will vary with percentage of dupes. My keeping my data generation (only sampling from three values) but pushing it to N=150 you guarantee a lot of dupes which will likely make this look better :) With the smaller data in my example it is still twice as the Rcpp approach. Still very good but not as 'golden' as shown here. It will all depend on Stephane's data. – Dirk Eddelbuettel Jun 10 '23 at 21:22

score 1 · Answer 2 · answered Jun 10 '23 at 14:41

If your vertices are always of length three you can take advantage of that. I find that a really naive implementation (where I use only one trick of turning the preservation of RNG state off) seems to beat R's own duplicated(). Note that I also only use one function that is called from R here to minimize the calling overhead.

Code

#include <Rcpp/Lightest>

inline bool dupevec(Rcpp::NumericVector x, Rcpp::NumericVector y) {
    return x[0] == y[0] && x[1] == y[1] && x[2] == y[2];
}

// [[Rcpp::export(rng=false)]]
Rcpp::LogicalVector dupes(Rcpp::NumericMatrix M) {
    int n = M.ncol();
    Rcpp::LogicalVector lv(n);
    for (int i=0; i<n; i++) {
        bool val = false;
        Rcpp::NumericVector x = M.column(i);
        for (int j=i+1; !val && j<n; j++) {
            val = dupevec(x, M.column(j));
        }
        lv[i] = val;
    }
    return lv;
}

/*** R
set.seed(123)
N <- 15
M <- matrix(sample(c(1,2,3), N*3, replace=TRUE), 3, N)
M
dupes(M)
duplicated(t(M), fromLast=TRUE)
microbenchmark::microbenchmark(Rcpp=dupes(M), R=duplicated(t(M), fromLast=TRUE))
*/

Output

> Rcpp::sourceCpp("~/git/stackoverflow/76445480/answer.cpp")

> set.seed(123)

> N <- 15

> M <- matrix(sample(c(1,2,3), N*3, replace=TRUE), 3, N)

> M
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,]    3    2    2    1    1    1    1    1    3     2     1     1
[2,]    3    3    2    2    2    3    1    3    2     3     3     3
[3,]    3    2    3    2    3    3    1    2    1     2     3     2
     [,13] [,14] [,15]
[1,]     1     1     3
[2,]     3     2     1
[3,]     1     3     3

> dupes(M)
 [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE

> duplicated(t(M), fromLast=TRUE)
 [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE

> microbenchmark::microbenchmark(Rcpp=dupes(M), R=duplicated(t(M), fromLast=TRUE))
Unit: microseconds
 expr    min      lq     mean  median     uq      max neval cld
 Rcpp 19.454 24.2775 102.3388 26.0460 27.587 7470.863   100   a
    R 65.982 73.3015  87.6992 77.1715 85.216  811.735   100   a
>

Thanks! I'm currently busy with somethingelse, I will more carefully look later. But at first glance, I'm under the impression there's something missing: I don't only want to know whether a column is duplicated, I also need to get the indices of the duplicates when the column is duplicated. I'm afraid I was not clear enough in my post. See you later. — Stéphane Laurent, Jun 10 '23 at 14:49

Efficiently find duplicated vectors in C++

Edit

2 Answers2

Code

Output