20

I have two unordered_set and want the intersection of those. I can't find a library function to do that.

Essentially, what I want is this:

unordered_set<int> a = {1, 2, 3};
unordered_set<int> b = {2, 4, 1};

unordered_set<int> c = a.intersect(b); // Should be {1, 2}

I can do something like

unordered_set<int> c;
for (int element : a) {
  if (b.count(element) > 0) {
    c.insert(element);
  }
}

However, I think there should be a more convenient way to do that. If there isn't one, can someone explain why? I know there is std::set_intersection, but that seems to operate on vectors only.

Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
radschapur
  • 445
  • 1
  • 3
  • 13
  • I sure don't understand why you think set_intersection is limited to vectors only, it takes two ranges as input iterators and an output iterator. Just about any standard container should be able to satisfy those requirements. – SoronelHaetir Jan 08 '18 at 22:21
  • Doing the simple loop approach is O(n) with an `unordered_set`. You sould use find instead of count though. @SoronelHaetir set_intersection needs a sorted set. – super Jan 08 '18 at 22:24

3 Answers3

21

In fact, a loop-based solutions is the best thing you can use with std::unordered_set.

There is an algorithm called std::set_intersection which allows to find an intersection of two sorted ranges:

Constructs a sorted range beginning at d_first consisting of elements that are found in both sorted ranges [first1, last1) and [first2, last2).

As you deal with std::unordered_set, you cannot apply this algorithm because there is no guaranteed order for the elements in std::unordered_set.

My advice is to stick with loops as it explicitly says what you want to achieve and has a linear complexity (O(N), where N is a number of elements in the unordered set you traverse with a for loop) which is the best compexity you might achieve.

Edgar Rokjān
  • 17,245
  • 4
  • 40
  • 67
  • So the complexity is `O(min(len(s), len(l)))` where s and l are the sets being intersected. Obviously you want to make `N=min(len(s), len(l))` and not `max` as almost all libraries do such as in Python's `set` object. Just clarifying since sometimes these details are forgotten. – Gregory Morse Nov 30 '19 at 06:14
  • @GregoryMorse: where did you get that info about Python `set.__and__` being `max` of operands `len`gths? [obviously](https://github.com/python/cpython/blob/fb5db7ec58624cab0797b4050735be865d380823/Objects/setobject.c#L1184-L1188) it's not true – Azat Ibrakov Oct 27 '20 at 13:58
  • This answer lacks the important detail already pointed out by @GregoryMorse : "the unordered set you traverse with a for loop" should be the smaller of the two unordered sets. – Don Hatch Nov 24 '20 at 03:02
0

As the accepted answer explains, there is nothing that directly does what you want, and std::set_intersection isn't intended for intersection std::unordered_sets, even though it sounds like it. The ideal solution depends on the C++ standard you're using.

Given two sets std::unordered_set<T> a, b; ...

In-place

... the best method to turn a into the intersection a & b is as follows:

// C++20 (needs <unordered_set>)
std::erase_if(a, [&b](int e) {
    return !b.contains(e);
});

// C++11 (needs <unordered_set>)
for (auto it = a.begin(); it != a.end();) {
    if (!b.count(*a)) { it = a.erase(it); }
    else              { ++it; }
}

Non-modifying

Alternatively, to compute a new set c which contains the intersection of a and b:

// C++23 (needs <unordered_set>, <ranges>)
std::unordered_set<int> c(std::from_range, a | std::views::filter([&b](int e) {
    return b.contains(e);
});

// C++20 (needs <unordered_set>, <ranges>)
auto view = a | std::views::filter([&b](int e) {
    return b.contains(e);
};
std::unordered_set<int> c(view.begin(), view.end());

// C++11 (needs <unordered_set>)
std::unordered_set<int> c;
// TODO: consider reserving a size (optimistically |a| + |b|)
for (int e : a) {
    if (b.count(e)) { a.insert(e); }
}
Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
-5

There is a function from std called set_intersection. However, it would have a very high complexity using it with std::set as input parameter.. A better solution is, create two vectors from those sets and use set_intersection with vectors as input parameters.

Onur A.
  • 3,007
  • 3
  • 22
  • 37
  • 2
    doesn't `set_intersection` require a sorted input? – Stephan Lechner Jan 08 '18 at 22:27
  • *"However, it would have a very high complexity using it with std::set as input parameter."* - Would you mind expanding on that? – Holt Jan 08 '18 at 22:27
  • If the values are already in std::unordered_set there is no more efficient approach then the simple loop. Your suggestion is just more complicated and worse performance. – super Jan 08 '18 at 22:31
  • 1
    @super Moving the value to vectors, sorting the vector, performing the intersection might be faster than the two above loops. You'd have to benchmark. – Holt Jan 08 '18 at 22:35
  • 1
    @Holt Not really. Using a loop + find gives O(n) time. Using 2 sorted vectors would also give O(n) time. But if you have to make the vectors AND sort them there is no way for that method to be faster. Please keep in mind that std::unordered_set has O(n) lookup unless the set is gigantic. – super Jan 08 '18 at 22:39
  • 3
    @super You are talking about theoretical asymptotic complexities. Vectors are known to perform faster than most containers for most operations, even faster at performing tasks dedicated to containers such as sorted insertion. You cannot assume that this approach will be slower unless you actually benchmark it. See e.g. https://lemire.me/blog/2017/01/27/how-expensive-are-the-union-and-intersection-of-two-unordered_set-in-c/. – Holt Jan 08 '18 at 22:41
  • @Holt Sure, in principle you are right about the fact that you need to benchmark something to really know. But here we are talking about comparing walking and running. In principle you would need to test... but in practice. – super Jan 08 '18 at 23:35
  • @super Not really walking vs running. I just did some quick tests with clang, -O3, and sets with 300000 elements and both approaches have the same running time (within 2%). I used random data, who knows what could happens using data with specific distribution... – Holt Jan 08 '18 at 23:37
  • @Holt Surely you are not including creating + sorting the vectors in the running time? – super Jan 08 '18 at 23:39
  • @super You should have a look at a Bjarne Stroustrup's talk (I think it was him, not 100% sure, maybe Herb Sutter... ) at cppcon (can't remember which) where he spoke about vector performance vs other standard containers. – Holt Jan 08 '18 at 23:42
  • @super Of course I am including creating and sorting the vector. – Holt Jan 08 '18 at 23:42
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/162773/discussion-between-super-and-holt). – super Jan 08 '18 at 23:46
  • @super I would be happy to continue this discussion but it's 1 a.m here so I'll pass for tonight. But I took the time to reboot my computer to upload this benchmark gist if you want to try it out yourself: https://gist.github.com/anonymous/bee65aa443ce6e35ae82fc3e5e89cca1 I get better performance with the non-vector version, but as you will notice, it is not as obvious as it would seem. I actually compiled with g++ (TDM on Windows), I am so used to use clang at work that I wrote clang in my previous comment... – Holt Jan 08 '18 at 23:51
  • [I made one too](https://pastebin.com/7WKLpXbv) The loop seems to end up somewhere around 55-65% of the intersect time on my machine with both our tests. – super Jan 09 '18 at 00:06
  • @super how many elements did you test with? Complexity differences are of course sensitive to dataset size. Complexity can only trump all other considerations when you know your data can be of infinite size. – Mark Ransom Jan 09 '18 at 00:16
  • @MarkRansom 10000, 100000 and 1000000. – super Jan 09 '18 at 00:28
  • @Holt If I modify your benchmark to save the intersection results in a vector instead of a set, the loop version is almost 5x faster. – super Jan 09 '18 at 01:19
  • @super true, but if you convert it back to an unordered_set, the loop version is only twice as fast as the vector version. Even if the loop version is 5 times faster than the vector one, this is not what you'd expect with theoretical complexity (log2(1e6) ~ 20) so you'd expect the loop version to be 20 times faster (due to the sort complexity), not counting memory allocation. Anothing thing to mention is that with the vector version, the result is already sorted, so you could use it directly without having to sort it. [...] – Holt Jan 09 '18 at 09:26
  • @super [...] My point is that you cannot simply look at the theoretical complexities of the operations on the containers to choose the best one, you have to consider how the container is used in the algorithm. If you are not looking for performance but for simplicity, then `set` and `unordered_set` might be great, but when I seek performance, I often try to do as much as possible with vectors, because vectors are much faster due to their memory layout and much easier to optimize for the compiler. [...] – Holt Jan 09 '18 at 09:29
  • @super [...] Even if the loop version end up winning here, I hope that by now you get my original point! BTW, I get better result with vector using your version, 100000 elements and `-O3` on clang (strange, isn't it?). Also, I'd like to point out that our benchmarks are quite poors since the two original sets tends to have an empty intersection, which is quite a specific here. – Holt Jan 09 '18 at 09:31