Efficient intersection of sets?

Question

I'm wondering what the most efficient way of doing this is.

I have points that I gather from 2 places.

I am only interested in the points which are common to both places.

My plan is to have 3 std::set<Point> . First I will add in the points from area A,into set A then points from B into set B and let set C be the intersection of both sets.

However, I'm wondering if there is a better way of doing this that involves maybe less sets?

Thanks

Is there a geometrical description of the overlap? – Mad Physicist Jan 07 '14 at 19:20 — Mad Physicist, Jan 07 '14 at 19:20

score 4 · Accepted Answer · answered Jan 07 '14 at 19:23

4

Your problem is so common that there even is a (named in an obvious way) standard algorithm set_intersection() for you to use.

answered Jan 07 '14 at 19:23

Griwes

8,805
2
43
70

Isn't this the solution OP was trying to improve upon? – mpartel Jan 07 '14 at 19:24
@mpartel, if there was a space for improvement, the standard library would surely include it. The standard algorithms are there for you to use them. If they were bad, no-one would use them, but actually the recommendation of the C++ community is to always prefer a standard algorithm over a hand written loop or own version of the algorithm (even suffering from terminal case of NIH is not a reason to avoid them). – Griwes Jan 07 '14 at 19:25
5

Certainly, they are very good implementations for the problems they do solve, but they don't directly solve all variations of every problem optimally. e.g. 'Useless' gave a very plausible improvement that exploited some reasonable assumptions that may well hold in the OP's case. (Of course all the usual warnings about premature optimization and benchmarking are still relevant.) – mpartel Jan 08 '14 at 10:12

score 1 · Answer 2 · answered Jan 07 '14 at 19:20

1

You can do away with set B: first gather points from A into set A, then gather points from B, but place them into C only if they are also present in A.

If sets A and B are of different (and predictable) sizes, the obvious choice would be to eliminate the larger one.

answered Jan 07 '14 at 19:20

user4815162342

141,790
18
296
355

What is set's contains function? – jmasterx Jan 07 '14 at 19:21
@Milo The same as the one you're using now for sets A, B, and C. – user4815162342 Jan 07 '14 at 19:22
@Milo This answer tries to answer the question as posed (as do the other ones). If you don't understand how to use sets, you should ask a separate question. – user4815162342 Jan 07 '14 at 19:24

Useless · Answer 3 · 2014-01-08T10:35:28.707

The naive set-based approach (built two sets, and then generate the intersection), has the following steps:

build std::set A from source 1. Say source 1 has N points, this is:
- O(N log N) time, O(N) space
build std::set B from source 2. Say this has M points, giving:
- O(M log M) time, O(M) space
std::set_intersection
- O(M+N) time and space

The slightly improved set-based approach is:

build std::set A from the first source (same complexity as above)
for every point in the second source, add it to the result if (and only if) it's in set A
- O(M log N) time, linear space
- so, you avoided O(M) extra space allocated, and the extra O(M+N) intersection step.
You'd implement this using std::set like so:
```
if (a.find(point) != a.end())
    result.insert(point);
```

Note that if you know which point source is going to provide fewer points, you should use that source to build set A for best performance. If your sources provide points in sorted order, you can avoid the sets entirely and save even more space & time.

score 0 · Answer 4 · answered Jan 07 '14 at 19:22

0

You can have a map<Point, int> where each point is mapped to the number of times it appears, i.e. 1 or 2. Then delete all points that appeared only once.

answered Jan 07 '14 at 19:22

mpartel

4,474
1
24
31

score 0 · Answer 5 · answered Jan 07 '14 at 19:23

0

You can use map<Point,int> to store the points occuring 1 or 2 times. You will also have to overload the < operator in that case.

answered Jan 07 '14 at 19:23

HelloWorld123456789

5,299
3
23
33

score 0 · Answer 6 · answered Jan 07 '14 at 19:23

I think approach with three sets and std::set_intersection is the most logical and efficient. Although if you really want to have less sets you can just iterate through set B and remove points which are not in A then iterate through set A and remove points which are not in B. This should do.

japreiss · Answer 7 · 2014-01-07T19:35:10.020

If you define a lexicographical operator< on points, i.e. (extrapolate for >2 dimensions if needed)

bool operator<(Point const &a, Point const &b)
{
    if (a.x != b.x) return a.x < b.x;
    return a.y < b.y;
}

Then it's possible that std::set will have good enough performance to compute the intersection efficiently. Assuming your two groups are already in arrays, or generated on the fly, you need 2 sets:

A = { all points from A }
C = { }
for all points in B:
    if point is in A:
        add point to C

If you're using floating-point coordinates, then you probably need to allow for floating-point errors and search for all points in A near each point in B. Then you want a spatial partition structure such as a k-d tree or quadtree/octree,. That would involve bringing in a new library or coding up a fairly complex data structure, so I would avoid it if possible.

Efficient intersection of sets?

7 Answers7

Linked