Counting number of unique pairs and instances of non-unique pairs in unsorted data

Question

I have data in the form of:

ID   ATTR
3    10
1    20
1    20
4    30
...  ...

Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may be anywhere between 100,000 and 500,000 pairs that I need to process at a single time.

I am looking for:

The number of unique pairs.
The number of times a non-unique pair pops up.

So in the above data, I'd want to know that (1,20) appeared twice and that there were 3 unique pairs.

I'm currently using a hash table in my naive approach. I keep a counter of unique pairs, and decrement the counter if the item I am inserting is already there. I also keep an array of IDs of the non-unique pairs. (All on first encounters)

Performance and size are about equal concerns. I'm actually OK with a relatively high (say 0.5%) rate of false positives given the performance and size concerns. (I've also implemented this using a spectral bloom)

I'm not that smart, so I'm sure there's a better solution out there, and I'd like to hear about your favorite hash table implementations/any other ideas. :)

score 2 · Accepted Answer · answered Sep 26 '11 at 01:08

2

A hash table with keys like <id>=<attr> is an excellent solution to this problem. If you can tolerate errors, you can get smaller/faster with a bloom, I guess. But do you really need to do that?

answered Sep 26 '11 at 01:08

Mike Sokolov

6,914
2
23
31

As a lowly student intern, that question is waaaay above my pay grade. ;) It's good to know I wasn't out in left field with the hash table. Thanks! – user962158 Sep 26 '11 at 18:11

Counting number of unique pairs and instances of non-unique pairs in unsorted data

1 Answers1

Linked