9

Here is my code, wondering any ideas to make it faster? My implementation is brute force, which is for any elements in a, try to find if it also in b, if so, put in result set c. Any smarter ideas is appreciated.

#include <iostream>
#include <unordered_set>

int main() {
    std::unordered_set<int> a = {1,2,3,4,5};
    std::unordered_set<int> b = {3,4,5,6,7};
    std::unordered_set<int> c;
    for (auto i = a.begin(); i != a.end(); i++) {
        if (b.find(*i) != b.end()) c.insert(*i);
    }
    for (int v : c) {
        std::printf("%d \n", v);
    }
}
Lin Ma
  • 9,739
  • 32
  • 105
  • 175

4 Answers4

10

Asymptotically, your algorithm is as good as it can get.

In practice, I'd add a check to loop over the smaller of the two sets and do lookups in the larger one. Assuming reasonably evenly distributed hashes, a lookup in a std::unoredered_set takes constant time. So this way, you'll be performing fewer such lookups.

Angew is no longer proud of SO
  • 167,307
  • 17
  • 350
  • 455
5

You can do it with std::copy_if()

std::copy_if(a.begin(), a.end(), std::inserter(c, c.begin()), [b](const int element){return b.count(element) > 0;} );
stephane k.
  • 1,668
  • 20
  • 21
3

Your algorithm is as good as it gets for a unordered set. however if you use a std::set (which uses a binary tree as storage) or even better a sorted std::vector, you can do better. The algorithm should be something like:

  1. get iterators to a.begin() and b.begin()
  2. if the iterators point to equal element add to intersection and increment both iterators.
  3. Otherwise increment the iterator pointing to the smallest value
  4. Go to 2.

Both should be O(n) time but using a normal set should save you from calculating hashes or any performance degradation that arises from hash collisions.

doron
  • 27,972
  • 12
  • 65
  • 103
  • Why do you think set is faster then unordered set in my use case? – Lin Ma Dec 20 '17 at 23:12
  • Save on hash calculation and true O(1) iteration without dealing with collisions. But benchmark if not sure. – doron Dec 21 '17 at 16:27
  • @doron depends on how fast hash calculation actually is - but if numbers of elements differ largely, hash sets gain advantage again (provided one selects the smaller one as reference). We'd now have to know the use cases, tree sets might come with trade-offs at other locations (random element access). Still good to show up an alternative... – Aconcagua Dec 28 '17 at 12:11
  • Note that step 3 can be tweaked a bit: instead of incrementing just once, look for the first element not smaller than the first of the other list. Phrased this way, it gets clearer that you can take advantage of random access (galloping strategy) or tree structure, to skip some elements. That does not affect the worst case, but it can matter in practice. – Marc Glisse Jan 17 '18 at 12:08
2

Thanks Angew, why your method is faster? Could you elaborate a bit more?

Well, let me provide you some additional info...

It should be pretty clear that, whichever data structures you use, you will have to iterate over all elements in at least one of those, so you cannot get better than O(n), n being the number of elements in the data structure selected to iterate over. Elementary now is, how fast you can look up the elements in the other structure – with a hash set, which std::unordered_set actually is, this is O(1) – at least if the number of collisions is small enough ("reasonably evenly distributed hashes"); the degenerate case would be all values having the same key...

So far, you get O(n) * O(1) = O(n). But you still the choice: O(n) or O(m), if m is the number of elements in the other set. OK, in complexity calculations, this is the same, we have a linear algorithm anyway, in practice, though, you can spare some hash calculations and look-ups if you choose the set with the lower number of elements...

Aconcagua
  • 24,880
  • 4
  • 34
  • 59
  • Nice catch! Thanks. – Lin Ma Dec 20 '17 at 23:12
  • Note that this is a worst case analysis. Depending on what the sets typically look like in this application, intersecting sorted sets may or may not be possible in sub-linear time (think of the extreme case where you intersect [0,100] with [200,300], one comparison is enough to notice that the result is empty). – Marc Glisse Jan 17 '18 at 11:30
  • @MarcGlisse Ah, back at doron's idea... (my answer - actually addition to a previous one - being about hash sets...). But we'd have to modify doron's algorithm to profit from such cases (as a whole or for partial ranges), maybe some clever partition algorithm? – Aconcagua Jan 17 '18 at 12:03
  • For the ordered case, some references can be found at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66418 . – Marc Glisse Jan 17 '18 at 12:11