Effective unique on unordered elements

Question

I'd like to improve the speed performance of my box counting method I use in fractal analysis.

About the task

I have a stream of ints (about n=2^24 long) and I have to calculate how many different values are in the stream. There's no upper boundary and negative values are allowed (but the amount of negative values is possibly less than sqrt(n)). There's a small correlation in the stream, i.e. the actual element is likely to be equal or not too far from the previous one. In many cases I have a lot of equal values in the whole range.

Methods I've already tried

vector, sort, uniqe

My first implementation was to put all the elements into a vector and then I applied a std::sort and then std::unique.

The complexity of this method is O(n*log(n)) and I don't think any other algorithm can be faster in general when it comes to scaling. But I'm sure a code must be exists that is faster than this but with the same scaling properties - so faster only with a constant factor. The reasons are:

I have a lot of equal values stored in the vector so sorting is not as effective, the vector is unduly large
In this method I don't use the information that the actual element and the previous are close to each other
I don't need information on what are those unique values, I only need the number of different elements

set, insert, size

To eliminate the first ineffectiveness point, I put every elements into a set with set::insert. And in the end I counted the number of elements with set::size.

My expectation was that this code must be faster because only unique values are stored in the set and it don't have to compare new elements with a lot of equal values. But unfortunately this method was 1.5 times slower than the previous one.

set, emplace_hint, size

To eliminates the second ineffectiveness point too, I not only put every elements into a set, but with the function set::emplace_hint. And every time a gave a hint to put the new element next to the previous one. And in the end, I asked for the size of the set with the set::size

My expectation was that this code must faster the previous one because I can guess the value of the new element and it's better than nothing. But unfortunately this method was 5 times slower then the previous one.

The question

Can you suggest any effective method that can calculates the number of different elements (ints) in a stream? Can you optimize the code if it is known that

there's a measurable correlation in the numbers
some numbers are appear repetadly

The target architecture is modern x86 or x86-64 PC processor (with sse, sse2) and only single thread code is suitable. I prefer not to use boost but c++11.

The solution

First, thanks for the many suggestions, patience and understanding and I'm sorry that I cannot test all of the methods and I'm also sure that effectiveness is depends on the details of the stream of ints what I've not provided. However I share the results I've got with VS2013 compiler. (Code is tested under gcc4.7 but not measured.) This topic worth a lot more time to investigate but I've got a solution that fits my needs. Time statistics for different methods

About the methods:

vector of bool: the BitVector solution from Dieter Lücking
binary lookup: the method suggested by Tony D
unordered set: putting simply all elements into an std::unordered_set, then ask for the number of its elements, as suggested by Ixanezis
vector insert sorted: using Dieter Lücking's Sorted Vector approach
set insert: the method I've described in the question form
radix sort: Ixanezis's suggestion, using a popular sorting algorithm on vector
set emplace hint: using std::emplace_hint as described in the question form

any upper bound for the `int`s? or it can be anything up to `INT_MAX`? negative number allowed? — Sakthi Kumar, Apr 18 '14 at 09:14
The `set` is likely inefficient because of memory issues. A set is a node container with 3 pointers and a tag of overhead per node in most implementations. On a 64-bits platform this easily adds up to 24 bytes of overhead to the 4 bytes of your `int`... and of course the nodes are all over the place. — Matthieu M., Apr 18 '14 at 09:23
@SakthiKumar The cost to get the maximum and minimum number is O(n). Negative numbers are allowed but it's very rare (possibility is less than 1/n). The task is more detailed now, thanks! — DanielTuzes, Apr 18 '14 at 10:15
We are lacking the proportion of unique items in the stream. The situation of being close to N unique items is very different from just U (where U << N) items. — Matthieu M., Apr 18 '14 at 13:31
As the stream comes from a fractal, the ratio of unique elements / total number of n elements follow a n^x law with x in range (0,-1]. If I could tell you the right ratio, I wouldn't need to use this algorithm. I'm looking for general suggestions that can be useful to anyone else, not only for me with a special ratio. — DanielTuzes, Apr 18 '14 at 15:02
Yes, it was faster than sorting vector, but vector (aka BitVector by Dieter Lücking) was even faster. I'll write the results into the question as soon as I tested most of the codes. — DanielTuzes, Apr 19 '14 at 21:00

Ixanezis · Answer 1 · 2014-04-18T09:45:35.277

5

Since you only deal with a limited range of integer numbers, the radix sort algorithm can be effectively employed here, reducing the log(N) part of the complexity. You can pick any really fast implementation somewhere in the internet. Some of them require SSE support, others are multithreaded or even coded to be run on GPU.

If you have the possibility to use boost::unordered_set or C++11 std::unordered_set, then your second approach can be easily modified you use it, also resulting in a linear complexity algorithm. However, if you have at least a few millions of numbers in a stream, I believe the first method would be faster.

edited Apr 18 '14 at 09:45

answered Apr 18 '14 at 09:22

Ixanezis

1,631
13
20

A good candidate for such a huge dataset will be binary quick sort (MSD Radixsort). The main advantage being that you can utilize 2^bits threads, without any interaction or synchronization. The idea is that you sort your elements to either end (like in quick sort), but based upon a given bit in the data. – Skeen Apr 18 '14 at 09:27
Well, if we want to employ parallelism in this task, I might say that radix sort is known to be heavily optimized for that kind of tasks and even being run on GPU with huge performance. – Ixanezis Apr 18 '14 at 09:38
Thanks for the good hint on radix sort, I'll check std::unordered_set soon. – DanielTuzes Apr 18 '14 at 10:58
@Ixanezis: Beware of `boost::unordered_set` and `std::unordered_set`, they have poor locality and probably between 16 bytes and 24 bytes of overhead *per element* (not counting the overhead of `malloc` for the nodes). There is a reason of course (not invalidating existing iterators when inserting) but since here it's superfluous, the cost is high. – Matthieu M. Apr 18 '14 at 13:43
I tried [msd radix sort](http://rosettacode.org/wiki/Sorting_algorithms/Radix_sort#C.2B.2B) but it was slower than vector sort (the default in my vs13 compiler is merge sort I think). unordered_set was faster than sorting vector. – DanielTuzes Apr 20 '14 at 07:26
Can you provide some kind of a data generator your have? I want to benchmark it myself. – Ixanezis Apr 20 '14 at 09:22
It may be too complicated but I can exposure the data. It is very kind of you to help me. – DanielTuzes Apr 20 '14 at 19:05
The data is also fine. – Ixanezis Apr 21 '14 at 06:13

score 4 · Accepted Answer · 2014-04-18T11:17:07.313

Just comparing different approaches (Not considering radix sort):

#include <algorithm>
#include <deque>
#include <iostream>
#include <unordered_set>
#include <set>
#include <vector>
#include <chrono>

template <template <typename ...> class Container, typename T, typename ... A, typename Comp>
inline bool insert_sorted(Container<T, A...>& container, T const& e, Comp const& comp) {
    auto const it = std::lower_bound(container.begin(), container.end(), e, comp);
    if (it != container.end() and not comp(e, *it)) { return false; }
    container.insert(it, e);
    return true;
}

template <template <typename ...> class Container, typename T, typename ... A>
inline bool insert_sorted(Container<T, A...>& container, T const& e) {
    return insert_sorted(container, e, std::less<T>{});
}

int main() {
    using namespace std::chrono;
    typedef std::vector<int> data_type;

    const unsigned Size = unsigned(1) << 24;
    const unsigned Limit = 1000;
    data_type data;
    data.reserve(Size);
    for(unsigned i = 0; i < Size; ++i) {
        int value = double(Limit) * std::rand() / RAND_MAX - 0.1;
        data.push_back(value);
        while(i < Size - 1 && rand() < RAND_MAX * 0.25) {
            data.push_back(value);
            ++i;
        }
    }

    std::cout
        << "Data\n"
        << "====\n"
        << "                Size of data: " << Size << '\n';

    std::cout
        << "Unorderd Set\n"
        << "============\n";
    {
        auto start = system_clock::now();

        typedef std::unordered_set<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            set.insert(data[i]);
        }
        if(i < Size)
            set.insert(data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Set\n"
        << "===\n";
    {
        auto start = system_clock::now();

        typedef std::set<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            set.insert(data[i]);
        }
        if(i < Size)
            set.insert(data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Sorted Vector\n"
        << "=============\n";
    {
        auto start = system_clock::now();

        typedef std::vector<int> set_type;
        set_type set;
        unsigned i = 0;
        for( ; i < Size - 1; ++i) {
            // Ignore a range of equal values
            while(data[i] == data[i+1]) ++i;
            insert_sorted(set, data[i]);
        }
        if(i < Size)
            insert_sorted(set, data[i]);

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << set.size() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "BitVector\n"
        << "=========\n";
    {
        auto start = system_clock::now();

        typedef std::vector<bool> set_type;
        set_type set(Limit);
        unsigned i = 0;
        unsigned elements = 0;
        for( ; i < Size; ++i) {
            if( ! set[data[i]]) {
                set[data[i]] = true;
                ++elements;
            }
        }

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << elements << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    std::cout
        << "Sorted Data\n"
        << "===========\n";
    {
        auto start = system_clock::now();

        std::sort(data.begin(), data.end());
        auto last = std::unique(data.begin(), data.end());

        auto stop = system_clock::now();

        std::cout
            << "Number of different elements: "
            << last - data.begin() << '\n';
        std::cout
            << "                      Timing: "
            << duration_cast<duration<double>>(stop - start).count()
            << '\n';
    }

    return 0;
}

Compiled with g++ -std=c++11 -O3 gives:

Data
====
                Size of data: 16777216
Unorderd Set
============
Number of different elements: 1000
                      Timing: 0.269752
Set
===
Number of different elements: 1000
                      Timing: 1.23478
Sorted Vector
=============
Number of different elements: 1000
                      Timing: 1.13783
BitVector
=========
Number of different elements: 1000
                      Timing: 0.038408
Sorted Data
===========
Number of different elements: 1000
                      Timing: 1.32827

Hence if memory is no issue or the range of numbers is limited setting a bit is the best choice. Otherwise, unordered_set is a good one.

For me, it's time to try unordered set and bitvector. Your stream seems a bit artificial, but can approximate my case. Thanks! — DanielTuzes, Apr 18 '14 at 11:52
There is an alternative to `bitset` => don't test and count at the end. Also, a bitset covering all of int range... is 512MB large. There is no helping it if there are many values, but that's really large. — Matthieu M., Apr 18 '14 at 13:48
I tried the bitset approach with an effective range of 32 bits: unfortunately [execution expired](http://coliru.stacked-crooked.com/a/457ff8e74e715168). Too much memory I fathom :x — Matthieu M., Apr 18 '14 at 14:08
@MatthieuM. your code posted in your link works on my system and gives: `Final: URES = 5403, UNIQ = 5310, SORT = 5640, BITSET = 1.30201e+06`(whatever that means) — , Apr 18 '14 at 15:39
@DieterLücking: the duration is in micro-seconds (I should have put the unit), so `URES` takes 5.4 ms and `BITSET` 1.3 s (260x slower). It's possibly due to a slow allocation and/or cache misses, the L3 cache is only 8MB (compared to the 512MB of the container). — Matthieu M., Apr 18 '14 at 15:45
Why are you using a std::deque in BitSetUniquer (make it a std::vector) ? — , Apr 18 '14 at 15:55

Tony Delroy · Answer 3 · 2014-04-18T12:27:57.643

3

Assuming 32-bit ints, worst case scenario is that you need 2^32 bits to track the seen/not-seen status of each number you might see. That's 4 billion bits, or 512 million bytes - 512 megabytes - not prohibitive for a modern desktop computer. You can basically index to a byte [n/8] into the array, then bitwise-and or -or with 1 << (n % 8) to set or test the number's seen state. Because you say the numbers close in the input tend to be close together in value, the cache utilisation should be pretty good. You could check for numbers just seen and bypass the bit-array processing.

If you happen to know that you have less than 2^32 distinct numbers to track in the input, you should of course reduce the size of the bit set accordingly. (Just read your comment "Negative numbers are allowed but it's very rare (possibility is less than 1/n)." - in that case, you could use a set for negative numbers, and use half the memory for the positives).

(If you're worried about a final iteration over many memory pages that may have no bits set at all, you could create an additional "dirty page" index with a bit per page to guide such iteration, but given the quantity of input, if that input's spread wildly across int's numeric range that may be insignificant or even counter-productive.)

EDIT /- further explanation as requested in comment. First, implementation:

template <size_t N>
class Tracker
{
  public:
    Tracker() { std::fill_n(&data_[0], words, 0); }
    void set(int n) { data_[n / 32] |= (1u << (n % 8)); }
    bool test(int n) const { return data_[n / 32] &= (1u << (n % 8)); }

    template <typename Visitor>
    void visit(Visitor& visitor)
    {
        for (size_t word = 0, word_n = 0; word < words; ++word, word_n += 32)
             if (data_[word])
                  for (uint32_t n = 0, value = 1; n < 32; ++n, value *= 2)
                      if (data_[word] & value)
                          visitor(word_n + n);
    }
  private:
    static const int words = N / 32 + (N % 32 ? 1 : 0);
    uint32_t data_[words];
};

Usage:

Tracker<size_t(std::numeric_limits<int>::max()) + 1> my_tracker;
int n;
while (std::cin >> n)
    my_tracker.set(n);
my_tracker.visit([](unsigned n) { std::cout << n << '\n'; });

(not tested... probably a few little issues)

Can you explain your answer more detailed please?

All this does is create what's conceptually just a bool have_seen[] array that can be directly indexed by any integer you're interested in: you just go through the input setting the boolean elements at indices you see in the input to true. If you set something to true two or more times - who cares? Purely to save memory and get speed for searching for set bits (and e.g. fill/clear), it's manually packing the bool values into bits in a larger integral data type.

I think I can bother with negative values because I can calculate the largest and smallest values for a total cost of O(n).

Well, maybe, but it might be faster or slower to do two passes. With the approach I've documented, you don't need to go over the data twice... you can be preparing the answer during the first iteration. Of course, if doing an initial iteration is fast (e.g. from SSD media), and you're tight enough on memory that you want to do the actual analysis for only the actual data range, then go for it.

It also facilitates to shrink the width of the int into the correct value, so more than half of the pages will be non-empty.

Not sure what you mean there.

edited Apr 18 '14 at 12:27

answered Apr 18 '14 at 10:09

Tony Delroy

102,968
15
177
252

Nice approach, and there's [std::bitset](http://en.cppreference.com/w/cpp/utility/bitset) to implement this easily. However, if you need to solve the task multiple times, you have to keep track of the bits set to `1` last time to be able to 'clean' them before the next launch. Simple `memset` would not probably be a nice idea, if datasets contain only about `2^24` elements. And +1 for actual application of the corellation input. – Ixanezis Apr 18 '14 at 10:18
@Ixanezis thanks :-). I nearly mentioned `bitset`, but AFAIK it lacks ability to report the set bits efficiently at the end of processing (calling `.test(n)` seems nasty, especially if there are long runs of 0s). For a fast `clear()`, the dirty page idea may be worthwhile - allows targeted `memset`s - if there're any bits on the page the cache faulting may be the bottleneck anyway. But, proof's always in the benchmarking... until the system or data changes. :-) – Tony Delroy Apr 18 '14 at 10:47
Can you explain your answer more detailed please? I think I can bother with negative values because I can calculate the largest and smallest values for a total cost of O(n). It also facilitates to shrink the width of the `int` into the correct value, so more than half of the pages will be non-empty. – DanielTuzes Apr 18 '14 at 11:30
I attempted it [here](http://coliru.stacked-crooked.com/a/457ff8e74e715168) with a `std::deque` as a basis; I might have done an error... if I did not though the method seems impractical (it times out whereas the other 3 methods presented on the page all have time to execute). I am afraid that beyond the amount of memory (which is indeed trivial on any computer), this is killed by cache misses. – Matthieu M. Apr 18 '14 at 14:16
@MatthieuM. I've only looked for 60 seconds, but aren't you sparing random numbers at it? That's the worst possible for caching, and very different from Daniel's stated input's characteristic of grouping and repeated inputs. So, at this stage I'm not going to investigate whether you have other implementation issues. Cheers. – Tony Delroy Apr 18 '14 at 16:31
@TonyD: I've read the OP description of the issue, and unfortunately it's quite imprecise. For example there is no ratio: the optimal solution for 100 unique numbers or for 2**16 unique numbers are probably very different! Although, he did not gave the chances of two consecutive values being close to each other, so once again a dead-end. Without further characterization of the issue, a random sample is as precise as I can get :x – Matthieu M. Apr 18 '14 at 16:37
@TonyD: Now I understand your suggestion and I unsay "shrink the width of the int". But now I think this solution can have a similar performance as a "BitVector" (a vector) described by Dieter Lücking. I will test your solution soon too. – DanielTuzes Apr 19 '14 at 20:56
@DanielTuzes the setting of bits is effectively like a `std::vector` or `std::bitset<>`, but neither of those have a mechanism for quickly finding just the set bits afterwards (like the "visit" code above). Matthieu's benchmark make me think the "dirty page" idea I mention may be worth implementing too - basically that means having say a `bitset<>` with one bit per memory page (likely 4k), if that's not set during iteration all 4096 bits in the larger bitset can be ignored, avoiding faulting the page back into cache. Benefit/cost depends on the proportion of dirty pages. – Tony Delroy Apr 20 '14 at 02:20
What about using `n & 7` instead of `n % 8`? It can be faster and gives the same value for n>=0. And similarly, `value << 1` instead of `value *= 2`? – DanielTuzes Apr 20 '14 at 07:37
@DanielTuzes you compiler's optimiser knows that too ;-). Use whatever's clearer to you, and the optimiser should use what's faster. – Tony Delroy Apr 20 '14 at 14:36
My compiler says `size_t(std::numeric_limits::max()) + 1` isn't compile-time constant expression. So I substituted it with 2147483648. The function `bool test(int n) const` is not used in the code, is it intended? Is `&=` a typo insted of `&` in this function? Why the compilacted `void visit(Visitor& visitor)` template? I mean it is fancy, but a [simple counter](http://pastebin.com/HGYrsBRC) would be enough. Anyhow, unfortunatelly the program gives Stack overflow and then Acces violation when it reaches the function where `Tracker<2147483648> my_tracker;` is used. – DanielTuzes Apr 20 '14 at 18:58
@DanielTuzes it may be easier to not have Tracker be a template, and work out the array size constant yourself, that may fix the stack overflow - if not, create a Tracker on the heap with new. test is just there to help make the class a bit more complete and self documenting. It should just be &. If you only want to count - fine - use code like Mattieu's - I thought you cared which values were seen. – Tony Delroy Apr 20 '14 at 20:20
[I corrected the code](http://pastebin.com/Pp6EX6Ex). So I removed the size template argument and stack overflow has gone. In `void set(int n)` the `% 8` can be a mistake and it gives bad result but `% 32` seems good. For small data, this method looks a bit slower than the fastest vector with in-time check (as written by Dieter Lücking) or post count. – DanielTuzes Apr 20 '14 at 22:20
As I'd thought you cared about the actual distinct values, checking as you set a bit didn't seem useful, but if you just want your count then you can modify the code in my answer to check the bit before setting it should indeed be faster. If you want to make this even faster, you could try using the [x86 BTS instruction](http://x86.renejeschke.de/html/file_module_x86_id_25.html) which calculates the bit value, sets the bit and returns the old value all in one. If page caching is the bottleneck it won't help - otherwise it may well. – Tony Delroy Apr 21 '14 at 04:22

score 1 · Answer 4 · answered Apr 18 '14 at 10:32

the standard data structure for this task is a hash set, aka std::unordered_set in stl (btw, google's dense_hash_set usually performs slightly better)

you do not need the unique values to be sorted, so std::set is uncessessarily slow for your use-case.

Like others suggest, you can also use a bitvector if your universe (possible values) is not too big If you have negative values, you can just cast to unsigned and treat them as really big numbers.

score 1 · Answer 5 · answered Apr 18 '14 at 10:43

1

Would it help to just compare the current element to the previous before passing it to the counting method, whatever it is ?

Or keep a small / fast cache of the, say last 10 elements, to discard short-range duplicates ?

Or do the counting in batches (count on a sequence of 100 with temporary counters, then merge with the previous counts) ?

answered Apr 18 '14 at 10:43

An adaptive code can efficient that recalculates how far should it check for duplicates first, and in the end, recheck all, possibly non-duplicate elements. But e.g, as I understand, Tony D's solution can not benefit from this because comparing is not more effective than set the "seen" bit to true. – DanielTuzes Apr 18 '14 at 11:37
@DanielTuzes "Tony D's solution can not benefit from this because comparing is not more effective than set the "seen" bit to true." - actually I say "You could check for numbers just seen and bypass the bit-array processing." - the division and modulus operations can be relatively expensive, though for powers of two the optimiser may use bit shifts which should be pretty fast. – Tony Delroy Apr 18 '14 at 12:24

score -1 · Answer 6 · edited May 23 '17 at 11:52

I approve trying to use STL containers (set, unordered_set, ...) but unfortunately you pay a penalty for them: their memory stability and lightweight iterators requirements have required them to be implemented as node-based containers, with a huge (relatively speaking) overhead for each and every element.

I would propose two methods instead:

Stick to vector (only efficient with a low proportion of unique items)
Implement Robin-Hood hashing
Use a probabilistic approach

Sorted Vector

For the vector approach: nothing prevents you to keep the vector sorted as you insert, and thus avoid inserting duplicate elements. An example here:

#include <iostream>

#include <algorithm>
#include <vector>

template <typename T, typename Comp>
void insert_sorted(std::vector<T>& vec, T const& e, Comp const& comp) {
    auto const it = std::lower_bound(vec.begin(), vec.end(), e, comp);

    if (it != vec.end() and not comp(e, *it)) { return; }

    vec.insert(it, e);
}

template <typename T>
void insert_sorted(std::vector<T>& vec, T const& e) {
    insert_sorted(vec, e, std::less<T>{});
}

int main() {
    int const array[] = { 4, 3, 6, 2, 3, 6, 8, 4 };

    std::vector<int> vec;
    for (int a: array) {
        insert_sorted(vec, a);
    }

    std::cout << vec.size() << ":";
    for (int a: vec) { std::cout << " " << a; }
    std::cout << "\n";

    return 0;
}

Displays: 5: 2 3 4 6 8.

This is still O(n log n), obviously, however it requires less memory:

less memory, more of the vector is in cache
the previous element being close to its successor is exploited by the binary search of lower_bound going through nearly the same cache lines

It should already be a great improvement.

Note: it has been pointed out that insertion in the middle of a vector was not efficient. Of course, since it involves shuffling half the already present elements (in average). Still benchmark suggests it can beat the current vector solution when the number of unique elements is small (0.1% is my benchmark).

Robin Hood Hashing

More involved, but Robin Hood hashing has very good characteristics, and thus performance. Most notably, it's implemented on top of a single dynamic array (like vector) so exhibits good memory locality.

Rust switched to Robin Hood hashing for its default implementation of hash tables and is very happy with it.

Note: From quick benchmarks even unordered_set beats the pants off store & store, and a naive open-addressed hash-table is 25% faster.

Probabilistic Approach

For very large problems a very well known algorithm is HyperLogLog. It's been implemented in Redis recently.

It has a very good ratio of used memory vs error rate, and is relatively simple to implement (especially following Antirez' code).

Throwing more hardware at the problem

Note that this is an embarrassingly parallel issue, you could thus easily have multiple threads each:

picking a bundle of IDs from the stream (say: 2**10 at once)
merging them into a thread-local set of unique IDs (whatever its implementation)
looping until the stream is empty
and finally merging their results together

And you could get a speed-up close to the number of threads (there is some overhead, obviously).

Note: not too coincidentally, the two approaches can be easily adapted with this method, and both support efficient merges.

I hope you're joking. Inserting into the middle of a vector requires shifting elements, it's obviously having linear complexity, thus resulting in overall `O(n*n*log(n))` complexity. Unless you have very few different elements in the dataset, this algorithm seems to be the worst of all. Or am I missing something.. — Ixanezis, Apr 18 '14 at 09:49
This is `O(n^2)`, not `O(n log n)`, because of the series of moves with each insertation. The algorithm you're describing is basically the [binary insertion sort](http://en.m.wikipedia.org/wiki/Insertion_sort#Variants). — Avidan Borisov, Apr 18 '14 at 10:07
@Avidanborisov No, it has the `O(n^2 * log n)`, because of the additional `lower_bound`. — Ixanezis, Apr 18 '14 at 10:10
No, it is `O(n(n + log n))` which is asymptotically equivalent to `O(n^2)` — Avidan Borisov, Apr 18 '14 at 10:22
I can agree with Ixanezis that inserting elements into the middle of a vector is not efficient. But a probabilistic approach is still not acceptable. I have a fractal and it is quite hard to estimate the error and I cannot use the result if it has artifacts. The hardware is given, the problem is not on how to parallelize the code and bigger compute capacity has nothing to do with code efficiency. Anyhow, thanks for your reply, I appreciate it! — DanielTuzes, Apr 18 '14 at 10:39
@Avidanborisov Yes, you're right, I'm mistaken. Cannot fix my first comment, unfortunately :( — Ixanezis, Apr 18 '14 at 11:35
@Ixanezis: No, I am serious. We are dealing with *two* numbers here: N the number of items read and U the number of *unique* items read. The `vector` solution only evolves in terms of `U` not in terms of `N`, so it obviously favors the case where `U << N`. Still, even when `U` approaches `N` it might be more efficient than the `set` insertion because of memory efficiency and locality; let's remember than the move of `int` is lowered down to large `memcpy`, then are not moved one at a time. Thus, before shooting this solution, I would recommend a benchmark. — Matthieu M., Apr 18 '14 at 11:57
@DanielTuzes: I [benchmarked](http://coliru.stacked-crooked.com/a/4c93661c0cba4ebc) the code for you here. While not always more efficient, the insertion sort approach still beats the store & sort approach when U << N. I also remembered about Robin Hood hashing, which you want to check out; it's much more efficient that `unordered_set` (less memory, less pointer indirections). — Matthieu M., Apr 18 '14 at 13:27
I'm sure that the vector insertation is not the optimal algorithm in general, but it worth a try. I'll try that too. — DanielTuzes, Apr 18 '14 at 23:13
@DanielTuzes: Oh, it is certainly not optimal in general; as noted by Ixanezis it degenerates quite badly actually. Even a simple `std::unordered_set` quickly run circles around it when `U` augments. If you wish to try something really more general & efficient, I believe Open-Addressing Hash Tables (such as those using Robin Hood hashing) are the best bet; unfortunately they are not standard and not even implemented in Boost as far as I know so finding a good implementation with an appropriate license is tough and rolling your own (debugged and optimized) is costly. — Matthieu M., Apr 19 '14 at 12:24
@DanielTuzes: here is [a quick & dirty Open-Addressing Hash Set](http://coliru.stacked-crooked.com/a/14a072d2a0bf54fb) which performs 25% better than `std::unordered_set`. — Matthieu M., Apr 19 '14 at 13:06