4

I have data in the form of:

ID   ATTR
3    10
1    20
1    20
4    30
...  ...

Where ID and Attr are unsorted and may contain duplicates. The range for the IDs are 1-20,000 or so, and ATTR are unsigned int. There may be anywhere between 100,000 and 500,000 pairs that I need to process at a single time.

I am looking for:

  1. The number of unique pairs.
  2. The number of times a non-unique pair pops up.

So in the above data, I'd want to know that (1,20) appeared twice and that there were 3 unique pairs.

I'm currently using a hash table in my naive approach. I keep a counter of unique pairs, and decrement the counter if the item I am inserting is already there. I also keep an array of IDs of the non-unique pairs. (All on first encounters)

Performance and size are about equal concerns. I'm actually OK with a relatively high (say 0.5%) rate of false positives given the performance and size concerns. (I've also implemented this using a spectral bloom)

I'm not that smart, so I'm sure there's a better solution out there, and I'd like to hear about your favorite hash table implementations/any other ideas. :)

user962158
  • 374
  • 1
  • 8

1 Answers1

2

A hash table with keys like <id>=<attr> is an excellent solution to this problem. If you can tolerate errors, you can get smaller/faster with a bloom, I guess. But do you really need to do that?

Mike Sokolov
  • 6,914
  • 2
  • 23
  • 31
  • As a lowly student intern, that question is waaaay above my pay grade. ;) It's good to know I wasn't out in left field with the hash table. Thanks! – user962158 Sep 26 '11 at 18:11