How to optimize a std::set intersection algorithm (C++)

Question

I'm struggling with a part of my college assignment. I have two subsets of std::set containers containing pointers to quite complex objects, but ordered by different criteria (which is why I can't use std::set_intersection()). I need to find elements that are contained in both subsets as fast as possible. There is a time/complexity requirement on the assignment.

I can do that in n*log(m) time where n is the size of the first subset and m is the size of the second subset by doing the following:

for(auto it = subset1.begin(), it != subset1.end(), it++){
    if(find(subset2.begin(), subset2.end(), *it))
        result.insert(*it);
}

This fails the time requirement, which says worst case linear, but average better than linear.

I found the following question here and I find the hashtable approach interesting. However, I fear that the creation of the hashtable might incur too much overhead. The class contained in the sets looks something like this:

class containedInSets {
   //methods
private: 
    vector<string> member1;
    SomeObject member2;
    int member3;
}

I have no control over the SomeObject class, and therefore cannot specify a hash function for it. I'd have to hash the pointer. Furthermore, the vector may grow quite (in the thousands of entries).

What is the quickest way of doing this?

I didn't say it is, I responded to your assertion that creating a hash table would be too much overhead. Or were you not talking about creating your own from scratch? — StoryTeller - Unslander Monica, Apr 22 '18 at 09:10
How it can be better than linear on average? You need to look at every element, isn't it? Or your comparison functions for two sets somehow connected and you can exploit it? — Yola, Apr 22 '18 at 09:14
@StoryTeller - Creating an unordered_set requires me to specify a hash template/function for it, which is why I am afraid of the overhead — MChrudina, Apr 22 '18 at 09:16
You need an order relation for put elements in a `std::set`. Doesn't that classify as overhead? Why is that not a concern? A hasher would be the same type of overhead. — StoryTeller - Unslander Monica, Apr 22 '18 at 09:18
@Yola It's for the purpose of result filtering. The subsets are results of single filters, but I need to filter by both. The requirement is for the result filtering in its entirety (from the base sets), not the subsets. The problem is that when the first filter is not set, it takes the entire base set and fails. — MChrudina, Apr 22 '18 at 09:18
@StoryTeller - I've been given the sets and can't do anything about them. — MChrudina, Apr 22 '18 at 09:19
If your assignment talks purely about big-O times, then you shouldn't worry about overhead. — Hong Ooi, Apr 22 '18 at 09:24
@HongOoi - It also takes the running time of the entire algorithm, not just one specific part of it. Which is why I include the overhead. — MChrudina, Apr 22 '18 at 09:27
Is that runtime measured asymptotically? Or are you required to report actual (mili)seconds elapsed? — Hong Ooi, Apr 22 '18 at 09:29
The question seems close to unanswerable with this "average *sublinear*" requirement, and then the switch to timings in the comments. Can you please specify the *actual* requirements in the question? — user4815162342, Apr 22 '18 at 11:22

Jarod42 · Answer 1 · 2018-04-22T11:23:09.617

3

Your code is not O(n log(m)) but O(n * m).

std::find(subset2.begin(), subset2.end(), *it) is linear, but std::set has methods find and count which are in O(log(n)) (they do a binary search).

So you can simply do:

for (const auto& e : subset1) {
    if (subset2.count(e) != 0) {
        result.insert(e);
    }
}

Which has complexity of n*log(m) instead of your n * m.

edited Apr 22 '18 at 11:23

answered Apr 22 '18 at 11:05

Jarod42

203,559
14
181
302

@user4815162342: OP's code is wrong, OP's code is `O(n * m)` even if OP claims it is `O(n log(m))` – Jarod42 Apr 22 '18 at 11:20

How to optimize a std::set intersection algorithm (C++)

1 Answers1