0

I have a histogram that is a vector/list of numbers. What is an easy and efficient algorithm for obtaining a hashcode of such a histogram? The hash code just needs to split the images on the hash value and not to compare images.

This application has no concerns on security, so cryptographic functions are unnecessarily slow.

Patrick M
  • 10,547
  • 9
  • 68
  • 101
  • Your question is pretty vague. Is your histogram just a list or multidimensional array of numbers? What's wrong with multiplying each number by a prime and adding it to a running total that allows overflow? That's the most basic hash approach. – Patrick M Mar 24 '15 at 19:33
  • Just a list ! That approach causes overflow and the hash values are negative ! I wonder if that is inefficient ! (novice) Sorry for the vague question. pretty confused and out of time And the histogram holds 16384 bins ! – WizzyVid Mar 24 '15 at 19:42

2 Answers2

0

The way to hash a list is to combine the hashes for each item. Java implements the hash function for a list like so:

public int hashCode() {
    int hashCode = 1;
    for (E e : this)
        hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
    return hashCode;
}

Notable properties:

  1. The hash code for every empty list is exactly 1.
  2. The hash code for lists with different numbers of elements are very likely to be different.
  3. The hash code for lists with the same number of elements is more likely to collide. Lists with the same elements in different order will collide; the hash code for lists [1,2] and [2,1] are unfortunately identical.
  4. This is a big of a drawback, but not as big as you might think at first. Hash tables implement a fallback where it checks for hashcode equality first and total equality second. If the difference in ordering occurs near the front of the lists, this fallback check is quick. At worst case, it only takes a number of comparisons equal to the length of the lists.

All in all, this would be a pretty good hash function for your use case, even if you use the numeric value of each histogram entry for its hash code. The problem you really want to avoid with hash functions is common-divisibility, meaning you want outputs from your hash function to fall into different buckets of a hash table. The Wikipedia article covers the properties of a good hash function if you want more information.

To obtain a better hash code for a list of numbers, we should look at a better hash code for an individual number, specifically this answer.

unsigned int hash(unsigned int[] list) {
    unsigned int hashCode = 0;
    for (int i = 1; i < list.length; i++) {
        hashCode = hashCode + list[i];
        hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
        hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
        hashCode = ((hashCode >> 16) ^ hashCode);
    }
    return hashCode;
}

I think that's a good adaptation, but I'm not an expect.

Regarding the efficiency of overflow, it's not a major slowdown unless you have to handle exceptions for it. In Java, arithmetic will never throw an overflow exception, instead just wrapping it around to the min or max value. There is no real drawback to having a negative hashcode, as long as your implementation of a hashtable supports it.

Community
  • 1
  • 1
Patrick M
  • 10,547
  • 9
  • 68
  • 101
0

I may not be understanding your problem correctly, but hashmaps already in matlab, they just have a different name

containers.maps

andrew
  • 2,451
  • 1
  • 15
  • 22