I'd like to improve the speed performance of my box counting method I use in fractal analysis.
About the task
I have a stream of ints (about n=2^24 long) and I have to calculate how many different values are in the stream. There's no upper boundary and negative values are allowed (but the amount of negative values is possibly less than sqrt(n)). There's a small correlation in the stream, i.e. the actual element is likely to be equal or not too far from the previous one. In many cases I have a lot of equal values in the whole range.
Methods I've already tried
vector, sort, uniqe
My first implementation was to put all the elements into a vector and then I applied a std::sort and then std::unique.
The complexity of this method is O(n*log(n)) and I don't think any other algorithm can be faster in general when it comes to scaling. But I'm sure a code must be exists that is faster than this but with the same scaling properties - so faster only with a constant factor. The reasons are:
- I have a lot of equal values stored in the vector so sorting is not as effective, the vector is unduly large
- In this method I don't use the information that the actual element and the previous are close to each other
- I don't need information on what are those unique values, I only need the number of different elements
set, insert, size
To eliminate the first ineffectiveness point, I put every elements into a set with set::insert. And in the end I counted the number of elements with set::size.
My expectation was that this code must be faster because only unique values are stored in the set and it don't have to compare new elements with a lot of equal values. But unfortunately this method was 1.5 times slower than the previous one.
set, emplace_hint, size
To eliminates the second ineffectiveness point too, I not only put every elements into a set, but with the function set::emplace_hint. And every time a gave a hint to put the new element next to the previous one. And in the end, I asked for the size of the set with the set::size
My expectation was that this code must faster the previous one because I can guess the value of the new element and it's better than nothing. But unfortunately this method was 5 times slower then the previous one.
The question
Can you suggest any effective method that can calculates the number of different elements (ints) in a stream? Can you optimize the code if it is known that
- there's a measurable correlation in the numbers
- some numbers are appear repetadly
The target architecture is modern x86 or x86-64 PC processor (with sse, sse2) and only single thread code is suitable. I prefer not to use boost but c++11.
The solution
First, thanks for the many suggestions, patience and understanding and I'm sorry that I cannot test all of the methods and I'm also sure that effectiveness is depends on the details of the stream of ints what I've not provided. However I share the results I've got with VS2013 compiler. (Code is tested under gcc4.7 but not measured.) This topic worth a lot more time to investigate but I've got a solution that fits my needs.
About the methods:
- vector of bool: the BitVector solution from Dieter Lücking
- binary lookup: the method suggested by Tony D
- unordered set: putting simply all elements into an std::unordered_set, then ask for the number of its elements, as suggested by Ixanezis
- vector insert sorted: using Dieter Lücking's Sorted Vector approach
- set insert: the method I've described in the question form
- radix sort: Ixanezis's suggestion, using a popular sorting algorithm on vector
- set emplace hint: using std::emplace_hint as described in the question form