3

I am receiving a stream of unordered Int32 values and need to track the count of distinct values that I receive.

My thought is to add the Int32 values into a HashSet<Int32>. Duplicate entries will simply not be added per the behavior of HashSet.

Do I understand correctly that set membership is based on GetHashCode() and that the hash code of an Int32 is the number itself?

Is there an approach that is either more CPU or more memory efficient?

UPDATE

The data stream is rather large. Simply using Linq to iterate the stream to get the distinct count is not what I'm after, since that would involve iterating the stream a second time.

Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • 1
    HashCode of an Int32 is the Int - yes, absolutely! See here http://stackoverflow.com/questions/3893782/how-is-gethashcode-implemented-for-int32 Additional question; do you receive all of the values at once, or in bursts over time? – dash Jun 27 '12 at 22:09
  • 2
    Set membership (for HashSet) is based on the hash code *and* equality, in general. For Int32 that's the same thing, but for most types it isn't. – Jon Skeet Jun 27 '12 at 22:10
  • The hashcode of an int is indeed the value itself, but it's irrelevant, since the values are compared anyway. – Thomas Levesque Jun 27 '12 at 22:11
  • @JonSkeet: Help me understand *and equality*. Do you mean the Equals() method? Reference equality? – Eric J. Jun 27 '12 at 22:18
  • @EricJ.: Equals. Although you can often provide an IEqualityComparer which would be used for both GetHashCode and Equals, rather than the one provided by the object itself. – Jon Skeet Jun 28 '12 at 05:59
  • @EricJ.: It has to be both `Equals` and `GetHashCode` because two unequal objects can have a hash code which is equal but the instance themselves aren't equal. While this is not true for Int32, it is true, for example, for Int64. In this case, `GetHashCode` just gets you into the right bucket, then you can search within the relatively smaller bucket if the item you are testing for equality is there. – codekaizen Jun 29 '12 at 03:06

5 Answers5

4

Assuming you have some sort of IEnumerable<int> you can do the following:

int count = stream.Distinct().Count();

Do I understand correctly that set membership is based on GetHashCode()

Not quite. Membership in a HashSet is based on a combination of GetHashCode and an equality check. In general, two objects can have the same hashcode but not be equal. Though for int that cannot happen.

and that the hash code of an Int32 is the number itself?

Yes, that is correct.

Is there an approach that is either more CPU or more memory efficient?

If you know that your ints will be in a small range, you can efficiently store which you have seen by using a bitmap. For example, if you have a range of 1,000,000 you can store which ints you have seen in 1,000,000 bits. A bit set to 1 at index n means that you have seen the integer n. Here's some example code showing one way to implement this:

void Main()
{
    int max = 1000000;

    IEnumerable<int> stream = GetStream(max);

    int count = DistinctCount(stream, max);
    int count2 = stream.Distinct().Count();
    Debug.Assert(count == count2);
}

int DistinctCount(IEnumerable<int> stream, int max)
{
    int[] seen = new int[max / 32];
    foreach (int x in stream)
    {
        seen[x / 32] |= 1 << (x % 32);
    }

    int count = 0;
    foreach (uint s in seen)
    {
        uint t = s;
        while (t > 0)
        {
            if (t % 2 == 1) { count++; }
            t /= 2;
        }
    }
    return count;
}

IEnumerable<int> GetStream(int max)
{
    List<int> stream = new List<int>();
    Random random = new Random();
    for (int i = 0; i < 2000000; ++i)
    {
        stream.Add(random.Next(max));
    }
    return stream;
}
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • It seems rather inefficient to enumerate the (very large) stream just to get the distinct count, as I'm already enumerating it for other processing. That is why my question is centered on how to efficiently implement the distinct count myself. – Eric J. Jun 27 '12 at 22:16
  • 1
    @EricJ I have to agree; your original `HashSet<>` is just as reasonable. – Marc Gravell Jun 27 '12 at 22:24
1

One thought, if you have a very large stream of data (millions to billions) is to use a Bloom filter. This will provide you with an ability to determine an approximate count as you stream the data, and if you have the need for an exact count, you can process it offline.

A reasonable C# implementation is here: http://bloomfilter.codeplex.com/

codekaizen
  • 26,990
  • 7
  • 84
  • 140
  • Good suggestion, but I need an exact count the first time though (+1 though...) – Eric J. Jun 30 '12 at 14:57
  • You can still get an exact count, but do it slower (any time after the approximate count). There is a physical limitation to using a hashset to contain integers - around 50 million of them (or fewer) a hashset becomes too large and memory becomes an issue. You'd have to employ some other strategy for an exact count at this point, and they are all slower or subject to similar memory constraints. I'd question the hard need for an exact count right away... often an approximation is valuable, even if it is a decision-making number, and the exact number can trail by seconds or minutes. – codekaizen Jun 30 '12 at 17:29
1

Don't really know your domain, but there are some algorithms to calculate cardinality of large sets using very small memory and processing.

I'm using HyperLogLog in a project of mine. I use it to count several million of distinct items using as low as 8KB of memory with 1% error.

Here is a paper describing it:

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

I've implemented it in Java and Python. The Python version is opensource and the algorithm is rather small. Check it out:

https://github.com/juanplopes/hyperloglog.py/blob/master/hyperloglog.py

Juan Lopes
  • 10,143
  • 2
  • 25
  • 44
0

I assume that you receive the values in chunks, be it one int at a time to a bunch of ints.

given that, the simplest thing is probably the best, I'd use a hash too. However I don't see how you can use a HashSet. If you want the count of distinct values, you'd only get the found values

Dictionary<int,int> _countHash = new Dictionary<int,int>();
void moreIntsArrived(IEnumerable<int> bunch)
{
   foreach(var value in bunch)
   {
       if (_countHash.ContainsKey(value))
       {
             _countHash[value] += _countHash[value];
       }
       else
       {
             _countHash[value] = 0;
       }
   }
}

However, do what Mr Hansleman suggests, measure it

There is probably a trade off between doing the ContainsKey check and just take the hit of the exception when the key is not found, IF your stream is large enough to stop getting new unique values

void moreIntsArrived(IEnumerable<int> bunch)
{
   foreach(var value in bunch)
   {
       try
       {
            int c = _countHash[value];
             _countHash[value] = c + 1;
       }
       catch(KeyNotFoundException)
       {
             _countHash[value] = 0;
       }
   }
}

Then again there is the Dictionary::TryGetValue() method but it depends what that does inside :-) Use the Source

Adam Straughan
  • 2,766
  • 3
  • 20
  • 27
  • 1
    HashSet adds a new item if and only if it's not already a member of the set. It's like a dictionary, but without unneeded values. – Eric J. Jun 27 '12 at 22:39
  • but if you want the count? where is that stored? myhash.Add(8), myHash.Add(8), how do I discover that there are 2 instances of 8? I don't see that in the docs (or I misunderstood the question = "I am receiving a stream of unordered Int32 values and need to track the count of distinct values that I receive.") – Adam Straughan Jun 27 '12 at 22:57
  • I don't need to know how many 2's, 6's and 42's, just the total number of distinct integers. If the data ream were 6 6 2 6 42 2 42, the answer would be "3". – Eric J. Jun 27 '12 at 23:37
  • Ah. A little bit of http://specificationbyexample.com/ goes a long way. My answer is for a different question :-), stick with HashSet as you already said. – Adam Straughan Jun 28 '12 at 06:44
0

I appreciate the other answers, but find that the original approach of using a HashSet<T> is most appropriate for my situation.

It is not efficient to re-iterate the stream to get the distinct count.

Eric J.
  • 147,927
  • 63
  • 340
  • 553