How to get all combinations by matching fields

Question

I have 4 classes. Jacket, Shirt, Tie and Outfit.

class Outfit {
//...
    shared_ptr<Jacket> jacket;
    shared_ptr<Shirt> shirt;
    shared_ptr<Tie> tie;
//...
}

class Jacket {
public:
    Jacket(const string &brand, const string &color, const string &size);
    // ... getters, setters etc. ...
private:
    string brand, color, size
}

// The same for shirt and tie, brand, followed by color or size

I need to get all the possible matches between jacket and shirts, jacket and ties respectively. Like this:

vector<shared_ptr<Jacket>> jackets { 
    make_shared<Jacket>("some_brand", "blue", "15"), 
    make_shared<Jacket>("some_other_brand", "red", "14") 
};

vector<shared_ptr<Shirt>> shirts {
    make_shared<Shirt>("brand1", "doesnotmatterformatchingcriteria", "15"),
    make_shared<Shirt>("brand6", "blue", "15"),
    make_shared<Shirt>("brand3", "blue", "14"),
    make_shared<Shirt>("brand5", "red", "15"),
    make_shared<Shirt>("brand6", "red", "14")
};

vector<shared_ptr<Tie>> ties {
    make_shared<Tie>("other_brand1", "blue"),
    make_shared<Tie>("other_brand2", "blue"),
    make_shared<Tie>("other_brand6", "blue"),
    make_shared<Tie>("other_brand7", "blue"),
};

void getAllPosibilities(vector<Outfit> &outfits) {
    for (const auto &jacket : jackets) {
        for (const auto &shirt : shirts) {
            if (jacket->getSizeAsString() == shirt->getSizeAsString()) {
                for (const auto &tie : ties) {
                    if (jacket->getColor() == tie->getColor()) {
                        outfits.push_back(Outfit(jacket, shirt, ties));
                    }
                }
            }
        }
    }
}

So basically I want all the combinations, regardless of the brand name, only by the fields i specify to match, but I think this is painfully slow, considering I keep nesting for loops. In my actual problem, I have even more fields to match, and more classes and i think it is not ideal at all.

Is there any better/simpler solution than doing this?

I'm actually trying to do this as a pre-fetch / look up table for other functions to use the data. Since i do not know when the for loops should end and I want all the possible results, I can't early-return. I should've mentioned it, but all the fields are std::strings. — yierstem, May 08 '21 at 14:30
Why can't you `continue` Y-loop when `objectX.getField1() != objectY.getField1()`? — 김선달, May 08 '21 at 14:34
I see, to break the nested Z loop to not check the same condition for every item in allObjectsZ, that's a good idea. thanks. — yierstem, May 08 '21 at 14:38
Idea: do profiling. Which function takes the most time? Is that surprising? What can you do to make it faster? Note: you should ask yourself the last question only after you answer the preceding ones. — anatolyg, May 15 '21 at 19:44
My understanding, what are you doing is the equivalent of a database join (inner), so just build some "indexes" (i.e. hashmaps or their equivalent in C++) on relevant attributes and do the maching working on indexes only. — Rocco, May 15 '21 at 21:49
Yes, i'm trying to get them all pre-fetched, so I can use them later. — yierstem, May 16 '21 at 05:54

He3lixxx · Accepted Answer · 2021-05-16T20:39:12.667

What you're doing here is commonly known in databases as a join. As an SQL query, your code would look like this:

select * from jacket, shirt, tie where jacket.size == shirt.size and jacket.color == tie.color;

Algorithmic Ideas

Nested Loop Join

Now, what you've implemented is known as a nested loop join, which usually would have complexity O(jackets * shirts * ties), however, you have an early return inside your shirt-loop, so you save some complexity there and reduce it to O(jackets * shirts + jackets * matching_shirts * ties).

For small data sets, as the one you provided, this nested loop join can be very effective and is usually chosen. However, if the data gets bigger, the algorithm can quickly become slow. Depending on how much additional memory you can spare and whether sorting the input sequences is okay with you, you might want to use approaches that utilize sorting, as @Deduplicator initially pointed out.

Sort Merge Join

The sort-merge-join usually is used for two input sequences, so that after both sequences have been sorted, you only need to traverse each sequence once, giving you complexity of O(N log N) for the sorting and O(N) for the actual join phase. Check out the Wikipedia article for a more in-depth explanation. However, when you have more than two input sequences, the logic can become hard to maintain, and you have to traverse one of the input sequences more than once. Effectively, we will have O(N log N) for the sorting of jackets, shirts and ties and O(jackets + shirts + colors * ties) complexity for the actual joining part.

Sort + Binary Search Join

In the answer @Deduplicator gave, they utilized sorting, but in a different way: Instead of sequentially going through the input sequences, they used binary search to quickly find the block of matching elements. This gives O(N log N) for the sorting and O(jackets * log shirts + log ties * matching jacket-shirt combinations + output_elements) for the actual join phase.

Hash Join

However, all of these approaches can be trumped if the elements you have can easily be hashed, as hashmaps allow us to store and find potential join partners incredibly fast. Here, the approach is to iterate over all but one input sequence once and store the elements in a hashmap. Then, we iterate over the last input sequence once, and, using the hash map, find all matching join partners. Again, Wikipedia has a more in-depth explanation. We can utilize std::unordered_map here. We should be able to find a good hashing function here, so this gives us a total runtime of O(jackets + shirts + ties + total_output_tuples), which is a lower bound on how fast you can be: You need to process all input sequences once, and you need to produce the output sequence.

Also, the code for it is rather beautiful (imho) and easy to understand:

void hashJoin(vector<Outfit> &outfits) {
    using shirtsize_t = decltype(Shirt::size);
    std::unordered_map<shirtsize_t, vector<shared_ptr<Shirt>>> shirts_by_sizes;

    using tiecolor_t = decltype(Tie::color);
    std::unordered_map<tiecolor_t, vector<shared_ptr<Tie>>> ties_by_colors;

    for(const auto& shirt : shirts) {
        shirts_by_sizes[shirt->size].push_back(shirt);
    }

    for(const auto& tie : ties) {
        ties_by_colors[tie->color].push_back(tie);
    }

    for (const auto& jacket : jackets) {
        for (const auto& matching_shirt : shirts_by_sizes[jacket->size]) {
            for (const auto& matching_tie : ties_by_colors[jacket->color]) {
                outfits.push_back(Outfit{jacket, matching_shirt, matching_tie});
            }
        }
    }
}

However, in the worst case, if our hashing does not give us uniform hashes, we might experience worse complexity, as the O(1) access can become worse. You would want to inspect the hash maps to spot this, and replace the hash function in this case.

Implementations

I've posted a working implementation of all four algorithms I discussed here on godbolt. However, since they are rather long, I only included the (superior) hash join algorithm in this answer

Lower bounds and output elements

As @Dominique pointed out, there is no way to have a better run time complexity than O(output_element_count), since you have to produce each output element at least once. So, if you expect that your result is (asymptotically) close to jackets * shirts * ties, the nested loop join is the variant with the lowest overhead, and thus should be fastest. However, I think this will not be the case here.

Deduplicator · Answer 2 · 2021-05-17T02:14:04.640

Sorting is often a good idea:

static constexpr auto size_comp = [](auto&& a, auto&& b){
    return a->getSizeAsString() < b->getSizeAsString(); };
static constexpr auto color_comp = [](auto&& a, auto&& b){
    return a->getColor() < b->getColor(); };

std::sort(jackets.begin(), jackets.end(), size_comp);
std::sort(shirt.begin(), shirt.end(), size_comp);
std::sort(tie.begin(), tie.end(), color_comp);

auto ps = shirts.begin();
for (auto pj = jackets.begin(); pj != jackets.end() && ps != shirts.end(); ++pj)
    ps = std::lower_bound(ps, shirts.end(), *pj, size_comp);
    auto pt = std::lower_bound(ties.begin(), ties.end(), *pj, color_comp);
    for (auto psx = ps; psx != shirts.end() && !size_comp(*pj, *psx); ++psx) {
        for (auto ptx = pt; ptx != ties.end() && !color_comp(*pj, *ptx); ++ptx)
            outfits.emplace_back(*pj, *psx, *ptx);
    }
}

score 0 · Answer 3 · edited May 15 '21 at 20:07

I don't think it's possible to optimise, in case you are interested in all cases, for this simple reason:

Imagine you have 2 brands of jackets, 5 of shirts and 4 of ties, then you are looking for 2*5*4, which means 40 possibilities. That's just the amount you're looking for, so no optimising there. Then the following loop is ok:

for all jackets_brands:
  for all shirts_brands:
    for all ties_brands:
      ...

However, imagine you have some criteria, like some of the 5 brands don't go together with some of the 4 brands of the ties. In that case, it might be better to alter the sequence of the for-loops, as you can see here:

for all shirts_brands:
  for all ties_brands:
    if go_together(this_particular_shirts_brand, this_particular_ties_brand)
    then for all jackets_brands:
           ...

In this way, you might avoid some unnecessary loops.

While it is true for the worst case that all combinations match and we will get the O(N*M*K) compexity, we usually don't expect it for the average case, so there is a lot of optimization potential there. [Databases go a long way to optimize these kinds of computations](https://en.wikipedia.org/wiki/Category:Join_algorithms), similar approaches could be taken here. — He3lixxx, May 16 '21 at 11:41