The fastest way to find union of sets

Question

I have sets of pairs of int like set<pair<int,int> > x1, x2, ... xn ( n can be between 2 and 20). What is the fastest way to find union of those sets ?

Sorry If I wasn't make clear at the beginning, I meant fast in performance, memory allocation is not a problem.

Can you please define _fast_? If you mean performance, please provide more background: Requirements, size of elements, number of elements, number of sets, target machine. — Sebastian Mach, Jul 06 '12 at 13:31
@phresnel: I can tell you the size of the elements, it's `sizeof >` ;-p I'm guessing 8 bytes. Number of sets is 2-20. The rest isn't in the question. — Steve Jessop, Jul 06 '12 at 13:56
@Steve: The question isn't really clear. If it is about runtime performance, then, among many other things, the total number of elements is highly relevant. And if they are not known, then the fact that they are not known or have some bounds (or not) is again highly relevant. Big-O measurements are really not everything, especially not for specific cases. But if it is about fast-to-implement, the question can be answered mildly :D — Sebastian Mach, Jul 06 '12 at 14:10
@Steve: Oops, I realise my own error wrt size of elements :D And I also misread "The rest isn't in the question" as "The rest isn't the question". Begging your pardon remorsefully. — Sebastian Mach, Jul 06 '12 at 14:32
I entirely agree with your point, it's not possible for any answer to say what will be fastest for the user's data, since we don't know what the user's data is. Same goes for the user's hardware and C++ implementation. — Steve Jessop, Jul 06 '12 at 14:51

Steve Jessop · Answer 1 · 2012-07-06T13:58:19.403

Assuming that the result needs to be a set too, then you have no choice but to insert every element of each x_i into that result set. So the obvious implementation is:

set<pair<int,int>> x(x1);
x.insert(x2.begin(), x2.end());
// etc

The remaining question is whether this can be beaten for speed.

The single-element insert takes a position hint, which if correct speeds up insertion. So it might turn out that something like this is faster than x.insert(x2.begin(), x2.end());:

auto pos = x.begin()
for (auto it = x2.begin(); it != x2.end(); ++it) {
    pos = x.insert(pos, *it);
}

It depends on the data, though: that position may or may not be accurate. You can ensure that it is by putting all the elements in order before you start, for which the best tool is probably set_union. That might better be named merge_and_dedupe_sorted_ranges, because what it does has nothing particularly to do with std::set. You could either set_union into intermediate vectors, or else into sets like this:

set<pair<int,int>> x;
set_union(x1.begin(), x1.end(), x2.begin(), x2.end(), inserter(x, x.end());

My concern with using set_union is that in order to get the benefit of adding the elements to a set in increasing order, you need to create a new empty container each time you call it (because if it's not empty then the elements added need to interleave with the values already in it). The overhead of these containers might be higher than the overhead of inserting into a set in arbitrary order: you would have to test it.

The question specifies that the input data are already `set`s. — ecatmur, Jul 06 '12 at 12:26
@ecatmur: I know. It doesn't specify the output format, though, which is why I made an assumption about it. — Steve Jessop, Jul 06 '12 at 12:28
I meant that because the input data are `set`s, you know that `x1` and `x2` are already sorted ranges. — ecatmur, Jul 06 '12 at 14:09
Good point about the position hint; I've asked http://stackoverflow.com/questions/11364320/complexity-of-inserting-sorted-range-into-associative-container about this issue. — ecatmur, Jul 06 '12 at 14:38

score 6 · Accepted Answer · answered Jul 06 '12 at 12:23

6

Unfortunately, I believe that you are limited to a linear O(N) solution, as all a union would be is a combination of the elements in both sets.

template<typename S>
S union_sets(const S& s1, const S& s2)
{
     S result = s1;

     result.insert(s2.cbegin(), s2.cend());

     return result;
}

answered Jul 06 '12 at 12:23

Richard J. Ross III

55,009
24
135
201

1

I don't really think that this option is the best, because insert takes O(log n) time per insert which is much more than the `2*(count1+count2)-1` of `set_union`. The first solution from @phresnel should be faster. – Till Dec 06 '12 at 14:25
@Till: The `inserter` of phresnel's solution uses the `set::insert` method, too. So it has the same big-O complexity as this solution here - which by the way isn't linear, as stated here, but `O(N*log(N))`. However, phresnel's solution *can* be linear, if you let the `inserter` insert the result in a `list` instead of a `set`. But I'm not sure, in how far that complies with the original question. – mastov Jun 24 '15 at 11:02
This is not correct, as it will introduce duplicates. A union consists of unique, non-repeated entries that are either in A, B, or both. – Aleksandr Hovhannisyan Jul 28 '17 at 00:51
1

@AleksandrH s1 and s2 are both sets already, which makes type S a set. The resulting set will automatically remove any duplicates during insertion. – Richard J. Ross III Jul 28 '17 at 02:47
@RichardJ.RossIII Hmm, but unless I'm mistaken, insert doesn't delete duplicates. For example, suppose s1 = {1, 2, 3} and s2 = {2, 3}. Then the union should be {1, 2, 3}. – Aleksandr Hovhannisyan Jul 28 '17 at 11:25
Aha, interesting. So insert gets rid of duplicates? – Aleksandr Hovhannisyan Jul 28 '17 at 20:41
@AleksandrH no, the *set* itself always gets rid of duplicates. By definition a set has at most one of an item. – Richard J. Ross III Jul 28 '17 at 22:22

Rafael Baptista · Answer 3 · 2012-07-06T12:49:02.880

Find the union of the smallest sets first. That is order your sets by set length, compute the union of the two smallest sets, delete those sets, insert the union into your set list according it its size.

If you had a measurement of how similar two sets are likely to be then you best bet there would be to first find the union of the most similar sets first. That is prefer union operations that eliminate duplicates early.

Edit: And for each union operation between two sets - merge the smaller set into the bigger set.

Sebastian Mach · Answer 4 · 2012-07-06T12:40:11.410

I assume with fast you mean fast to implement.

Then: std::set_union (*)

Example for two sets:

#include <set>
#include <algorithm>
#include <iterator>
using namespace std;

int main () {
    set<pair<int,int> > a, b, uni;
    set_union (a.begin(), a.end(),
               b.begin(), b.end(),
               inserter(uni, uni.begin()));

}

for n sets, hand writing it might be the most maintainable solution:

#include <set>
#include <vector>
using namespace std;

int main () {
    vector<set<pair<int,int>>> sets;
    set<pair<int,int>> uni;

    for (const auto &s : sets)
        for (const auto &elem : s)
            uni.insert (elem);
}

though in general, one should prefer standard algorithms and profit from their quality implementation.

If by fast you mean performance, we can't help as we don't have the requirements. Different approaches might give different results for different circumstances.

(*) note: the site is frowned upon sometimes for not being 100% accurate vs. the standard

score 3 · Answer 5 · answered Jul 06 '12 at 12:20

3

Try the set_union in the header algorithm.

answered Jul 06 '12 at 12:20

Anon Mail

4,660
1
18
21

score 3 · Answer 6 · answered Jul 06 '12 at 12:38

3

To save on memory allocations and improve locality, it'd be better to use a single vector<T> as working memory.

Construct a vector<T> and reserve the total number of elements in all of the s (counting duplicates). Then, starting with the empty range [v.begin(), v.begin()), extend it to a set-like (unique, sorted) range by appending the contents of each set, merging and uniquifying:

vector<T> v;
v.reserve(<total size>);
for (set<T> &s: sets) {
    auto middle = v.insert(v.end(), s.begin(), s.end());
    inplace_merge(v.begin(), middle, v.end());
    v.erase(v.unique(v.begin(), v.end()), v.end());
}

answered Jul 06 '12 at 12:38

ecatmur

152,476
27
293
366

Since the questioner is interested in speed, wouldn't an n-way de-duping merge likely be faster than n in-place 2-way merges followed by dedupes? In your current code the contents of the first set get moved around in the vector 2*(n-1) times, which seems wasteful. Of course it's way more code, since there's no standard algorithm for an n-way merge. – Steve Jessop Jul 06 '12 at 13:04
Probably, yes; I suppose you'd want to keep a list of iterators sorted by their indirected value. It'd be interesting to compare performance. – ecatmur Jul 06 '12 at 13:31

score 2 · Answer 7 · answered Jul 06 '12 at 12:21

You could use std::set_union recursively or simply insert all sets into a result set (duplicate items are eliminated by the set). If the number of items is very small you can try to insert it all into a vector, sorting it and use std::unique on the vector.

The fastest way to find union of sets

7 Answers7

Linked