0

I'm using a hash table (DotNET Dictionary object) as part of a sparse two-dimensional data set. Most of the entries in the hash table will be close together. I'll probably end up with 100 ~ 10,000 entries, all of them clustered near zero. I've read that a Hash table performs better when the hashes are spread across the entire integer(32 bit) range.

Is there a cheap way to map consecutive integers onto wildly different values in a 1:1 fashion? I don't have to map them back, it's purely a one-way thing.

David Rutten
  • 4,716
  • 6
  • 43
  • 72
  • 1
    First of all the real performance killer is not an issue using dictionary. The real killer is when youn end up with a table where multiple objects have the same key but that's not an option with dictionary. Further more it's not having your objects scatterred over some arbitrary set of key vallues that counts. a set like 1,2,3,4 will potentially use less memmory than 1 1024 1089999 2^32-1 – Rune FS Sep 19 '09 at 05:34
  • To improve performance of Dictionary in .NET, you have to balance collision rate and speed of hashing. To have a perfect hash without collision it will be more time consuming. Similarly the fastest hashing algorithm will have more collisions. Finding the balance is the key, and in that respect, BCL team would have done their credible job well, so just rely on it unless you have performance problems.. – nawfal Dec 15 '12 at 06:58

3 Answers3

3

Maybe I'm misunderstanding what you are saying, but Dictionary will already hash your integers. There shouldn't be any need to pre-hash them. Why not try out the default implementation and see how it goes instead of attempting a pre-optimizion that will in all likelihood be pointless.

dan-gph
  • 16,301
  • 12
  • 61
  • 79
  • If you disassemble the Int32 type, the hash code is merely the number itself. Hash tables work better if the hash values are spread across the entire range. You're right of course, I should try both and see if it makes a difference at all, but for me to try it both ways, I need a way to remap the integers. – David Rutten Sep 19 '09 at 19:39
  • @David, fair enough. There are some hash functions here: http://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key – dan-gph Sep 20 '09 at 01:29
1

If you know the maximum value of your keyset (kmax), you could expand by a constant factor (multiplier), say multiply by a fixed prime number that keeps the product below the max integer size (2^31 - 1):

i.e. the nearest prime number to (2^30) / kmax

Note: make sure the prime used is not the same as the number of buckets in the Hash table.

Here is another solution: Since the .NET Random class will generate the same value for the same seed, you could use that to distribute the incoming keys.

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
  • Interesting solution. I can be reasonably sure that integers will remain low, but this class is going into an SDK so I hesitate to make this a hard constraint. – David Rutten Sep 19 '09 at 05:37
1

Instead of using Integer, write a class that Inherits from Integer, and override the GetHashCode function. This way you don't have to do anything but create this function!

The easiest way I can think of to spread out the values evenly is to do something like:

public class MyInteger:Integer
{
    public override int GetHashCode()
    {
       unchecked
       {
           return (int)Math.Pow(this,this);
       }
    }
}

Nice and evenly split up, while keeping the effort to a minimum.

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
Erich
  • 657
  • 5
  • 6
  • @Erich, thanks, but is this guaranteed to provide a unique mapping of every single possible integer? – David Rutten Sep 19 '09 at 05:30
  • 1
    It won't, but assuming they are reasonably clustered they will be. Also, you don't NEED a unique mapping the way a hashtable works. They will be spread out enough that your speed will be quite fast. Other than the range, you might be better off just using an Array, with an int-index. Each empty spot only takes 4 bytes, so it won't be terribly large. You mention it is 10,000 as the highest value, so that is only 40,000 bytes, or 40k for the whole array, which will have an O(1) search and insert time. if you are willing to give up a k of memory, it would be best done that way. – Erich Sep 19 '09 at 05:33
  • Good point, I don't mind spending a lot of memory on this since it will only need that memory very briefly. I'll try this approach as well and see if it outperforms the hash-table way. Thanks! – David Rutten Sep 19 '09 at 19:42
  • What is an Integer class in .NET?? :o – nawfal Dec 15 '12 at 06:53