Sort a vector of vectors and count frequencies of unique occurences using the standard library

Question

Given a std::vector<std::vector<int>>:

I want to output a sorted std::vector<std::vector<int>>
that contains only unique std::vector<int>
as well as the frequency (i.e., count) of these unique std::vector<int>

My question is twofold:

how can I efficiently achieve this relying only on the standard library?
what is causing the code below to perform poorly?

I tried the following:

std::vector<std::vector<int>> responseVectors 
{
    { 2,  3,  2 },
    { 3,  2,  3 }, 
    { 3,  2,  3 }, 
    { 3,  3,  3 }, 
    { 1,  2,  3 }, 
    { 1,  2,  3 }, 
    { 1,  2,  3 }, 
    { 2,  1,  2 }, 
    { 0,  1,  2 }, 
    { 2,  3,  2 },
    { 3,  2,  3 }, 
    { 3,  2,  3 }, 
    { 3,  3,  3 }, 
    { 1,  2,  3 }, 
    { 1,  2,  3 }, 
    { 1,  2,  3 }, 
    { 2,  1,  2 }, 
    { 0,  1,  2 }
};

std::vector<std::vector<int>> uniqueResponseVectors;
uniqueResponseVectors.reserve(responseVectors.size());

std::sort(responseVectors.begin(), responseVectors.end());
std::unique_copy(responseVectors.begin(), responseVectors.end(), std::back_inserter(uniqueResponseVectors));

std::vector<int> freqTotal;
freqTotal.reserve(uniqueResponseVectors.size());

for(auto&& vector : uniqueResponseVectors)
{
    int count = std::count(responseVectors.begin(), responseVectors.end(), vector);
    freqTotal.push_back(count);
}

And the output should be:

for(auto&& vector : uniqueResponseVectors)
{
    for(auto&& value : vector)
    {
        std::cout << "\t" << value << "\t";
    }

    std::cout << "\n";
}

// Printed result for the `uniqueResponseVectors`:
//
//    0               1               2
//    1               2               3
//    2               1               2
//    2               3               2
//    3               2               3
//    3               3               3

// Similarly for the `freqTotal`:
//
//    2
//    6
//    2
//    2
//    4
//    2

Additional context:

when I tried the code above against a larger dataset (i.e., responseVectors with a size of 100000 and 12 as the size of each std::vector<int>), it was slow.
I also tried to manipulate responseVectors directly by calling std::unique and erase() on it, in order avoid creating a copy. Then I used an iterator loop to std::count how many times a unique std::vector<int> occurs. However, this proved even slower.

if you use a `map` instead of vector you get what you want for free. Closely related: https://stackoverflow.com/questions/54476881/how-to-apply-the-concept-of-counting-occurrences-on-strings-variables-in-c (question is slightly different but the answer is a perfect match) — 463035818_is_not_an_ai, Feb 01 '19 at 12:20
what makes your solution slow compared to using a map is that you are traversing the input several times, when you use a map you just insert the elements — 463035818_is_not_an_ai, Feb 01 '19 at 12:26
@user463035818 Can you please elaborate in an answer? I appreciate the input, but not sure how to switch to using a `map`. — Mihai, Feb 01 '19 at 12:30
did you look at the link above? I could not write a better answer than given there — 463035818_is_not_an_ai, Feb 01 '19 at 12:39
Sorry, I completely missed the link. I am checking the answer (i.e., https://stackoverflow.com/a/54477263/5252007) right now! — Mihai, Feb 01 '19 at 12:41
actually I would flag your question as a duplicate. I didnt do it, because then your question would get closed immediately. If you read the answer and it did solve your problem, I think you could also flag your own question as a duplicate of the other one — 463035818_is_not_an_ai, Feb 01 '19 at 12:44
I am trying that implementation right now. If it solves my problem, I will write an answer below explaining why, provide a link to @YSC solution and flag it. Thanks! — Mihai, Feb 01 '19 at 12:48
@Mihai It is good you're using STL, but your question has far more to do with using the proper data structure(s). A map, hash table, etc. should have been the choice, and given that, you have `std::map`, `std::unordered_map`, etc. — PaulMcKenzie, Feb 01 '19 at 13:00
Wow! What I just learned is fantastic; made my day! Thanks for your suggestions @user463035818 and @PaulMcKenzie! — Mihai, Feb 01 '19 at 13:31
If you want to have unique elements and keep the same spirit of the vector, you can use `std::set`. — doudouremi, Feb 01 '19 at 14:04
Possible duplicate of [how to apply the concept of counting occurrences on strings variables in c++](https://stackoverflow.com/questions/54476881/how-to-apply-the-concept-of-counting-occurrences-on-strings-variables-in-c) — Mihai, Feb 01 '19 at 14:25
always glad to help. `std::map` already made my day several times ;). Just beware that performance wise there are usage patterns where `std::vector` still beats `std::map` (just consider the different complexity of their methods to insert, sort, etc.) — 463035818_is_not_an_ai, Feb 01 '19 at 14:45

Mihai · Accepted Answer · 2019-02-01T14:31:21.283

user463035818 pointed me to an excellent answer by YSC
and PaulMcKenzie further iterated on the importance of picking the proper data structure(s)

Based on this input, I tried to particularize the solution provided by YSC to my case, in order to better understand what is going on. It boils down to using std::map which is a sorted associative container:

std::map<std::vector<int>, int> SortAndCountOccurences(std::vector<std::vector<int>>& vectors)
{
    std::map<std::vector<int>, int> result;

    std::for_each(vectors.begin(), vectors.end(), [&result](auto const& vector) {
            ++result[vector]; 
    });

    return result;
}

With the following usage:

std::vector<std::vector<int>> responseVectors 
{
    { 2,  3,  2 },
    { 3,  2,  3 }, 
    { 3,  2,  3 }, 
    { 3,  3,  3 }, 
    { 1,  2,  3 }, 
    { 1,  2,  3 }, 
    { 1,  2,  3 }, 
    { 2,  1,  2 }, 
    { 0,  1,  2 }, 
    { 2,  3,  2 },
    { 3,  2,  3 }, 
    { 3,  2,  3 }, 
    { 3,  3,  3 }, 
    { 1,  2,  3 }, 
    { 1,  2,  3 }, 
    { 1,  2,  3 }, 
    { 2,  1,  2 }, 
    { 0,  1,  2 }
};

std::map<std::vector<int>, int> sortedVectors = SortAndCountOccurences(responseVectors);

Which will output:

for(auto&& vector : sortedVectors)
{
    for(auto&& value : vector.first)
    {
        std::cout << "\t" << value << "\t";
    }

    std::cout << "-> x" << vector.second << " times" << "\n";
}

//    0               1               2       -> x2 times
//    1               2               3       -> x6 times
//    2               1               2       -> x2 times
//    2               3               2       -> x2 times
//    3               2               3       -> x4 times
//    3               3               3       -> x2 times

Note: The solution by YSC can be generalized to anything iterable.

Thanks for the tip! What would be the advantage of a range-for? — Mihai, Feb 01 '19 at 16:32
Range-for is simpler than `std::for_each()` with lambda, it doesn't rely so much on the compiler to inline everything, and it allows you to use `break`, `return` and other normal flow-control. In addition, it's part of the core language instead of the standard library. — Deduplicator, Feb 01 '19 at 16:58
`std::for_each` is great whet you have a range (couple of begin/end iterators). range-for is great whet you have a collection (container, array, ...). This is a style issue, nothing serious here. — YSC, Feb 01 '19 at 16:58
@YSC When the Range TS is finally available, one can trivially adapt any iterator-pait for use with range-for, making `std::for_each()` pretty obsolete. — Deduplicator, Feb 01 '19 at 18:20

Sort a vector of vectors and count frequencies of unique occurences using the standard library

1 Answers1

Linked