8

I'm confused with java interview question about which hashcode implementation is better. We have a class Point {int x, y; }. Why does implementation of hashcode 31 * x + y for this class is better than x + y ? The right answer is "The multiplier creates a dependence of the hashcode value on the order of processing the fields, which ultimately gives rise to a better hash function". But I can't understand why the order of processing is the point here because the entire expression 31 * x + y is calculating when I executes point1.equals(point2); And there is no matter in which order it happens. Am I wrong?

200OK
  • 701
  • 5
  • 14
  • 1
    If you happen to use Point as a key in map/hashtable, having `hashCode() {return x + y;}` would create redundant hash collision for point (1,2) and point (2,1), etc. – Nowhere Man Apr 20 '20 at 07:59
  • 1
    I think the amount of collisions will remain the same. With `hashCode() {return 31 * x + y;}`, pairs (0,31) and (1,0) will have a collision as well. – Amongalen Apr 20 '20 at 08:03
  • @Amongalen It's always _possible_ to have collisions, but in real-world usage the values 0 and 1 will be found much more frequently than 31. – chrylis -cautiouslyoptimistic- Apr 20 '20 at 08:06
  • @Amongalen The is no perfect hash, and all have collisions. But the function used may lead to less (or more) collision in usual cases. `x+y` would introduce collisions by symmetry (a very usual property in geometry). – Jean-Baptiste Yunès Apr 20 '20 at 08:07
  • After searching a bit I've found a more interesting answer in [here](https://stackoverflow.com/a/299748/4949750) - a quote from Effective Java. Worth checking out I think. – Amongalen Apr 20 '20 at 08:10
  • 1
    @Amongalen your link doesn't explain why it is better, just that it is used in Java. 31 has been chosen empirically so that the tradeoff between collisions and effective computation. The general method has been studied by Knuth, using polynomials and powers of primes. – Jean-Baptiste Yunès Apr 20 '20 at 08:23

3 Answers3

8

If you use x+y then how to distinguish points (3,4) and (4,3)? Both will have the same hashcode...

Now while 31 * x + y will not be perfect, in the same case, it will be much much better.

Note: by definition of hashing there is no perfect hashing. The only thing is to analyse what kind of collisions occurs for a given hash function. In the geometrical case the first one introduces collisions for a very simple and usual symmetry property. Thus in very common cases there could be too much collisions.

Jean-Baptiste Yunès
  • 34,548
  • 4
  • 48
  • 69
  • In the first case, (3,4) and (4,3) have the same hashcode. In the second case, (0,31) and (1,0) have the same hashcode. Why second is better than the first? – Amongalen Apr 20 '20 at 08:05
  • @Amongalen because the geometrical property induced by the equivalence function is much more unusual in the second case. – Jean-Baptiste Yunès Apr 20 '20 at 08:08
  • @Amongalen, this can be fixed by also multiplying the second item, e.g.: `31 * x + 37 * y` – Titulum Apr 20 '20 at 10:07
  • 2
    @Titulum and then some other pair of points will collide with each other, eg. (0,31) and (37, 0) :) – Amongalen Apr 20 '20 at 10:16
  • @Titulum No way. Coding two integers into one ===> collision, whatever you do. – Jean-Baptiste Yunès Apr 20 '20 at 11:09
  • @Amongalen, `Now while 31 * x + y will not be perfect, in the same case, it will be much much better.`. It's all about reducing the probability of collisions. I've added an answer with a small experiment. – Titulum Apr 20 '20 at 11:42
6

Imagine you have two string properties prop1 and prop2, and two objects:

A: {prop1="foo", prop2="bar"}
B: {prop1="bar", prop2="foo"}

These are clearly different values, and it's useful to set up the hash code to distinguish between them. If you simply add the properties' hash codes together, you'll get the same value for both A and B. Instead, by multiplying and adding, the hash code will be different based on the property sequence.

It seems you may be slightly misinterpreting the advice: The purpose of the multiply-and-add is to create a dependence on the semantic order of properties within an object, not the execution order of the calculation.

chrylis -cautiouslyoptimistic-
  • 75,269
  • 21
  • 115
  • 152
  • >>> "foo" * 37 + "bar" 'foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoobar' >>> – Marichyasana Apr 20 '20 at 21:18
0

Jean-Baptiste Yunès' answer is the correct one, but I will add the following example to illustrate (keep in mind that it's in javascript, only because I quickly implemented this for the example):

class Point {
    constructor(x, y) {
        this.x = x;
        this.y = y;
    }
}

function getHashCollisions(collection, hashFunction) {
    const collisionMap = new Map();
    let count = 1;
    let total = collection.length;
    for (const point of collection) {
        console.log(`calculating ${count++}/${total}`);
        const currentHash = hashFunction(point);
        const hashCount = collisionMap.has(currentHash) ? collisionMap.get(currentHash) +1 : 1;
        collisionMap.set(currentHash, hashCount);
    }
    return collisionMap;
}

function generateDataset(rangeX, rangeY) {
    const points = [];
    let count = 1;
    for (let x = 0; x < rangeX; x++) {
        for (let y = 0; y < rangeY; y++) {
            console.log(`generating ${count++} Point(${x};${y})`);
            points.push(new Point(x, y));
        }
    }
    return points;
}

function calculateAndGenerateReport(dataset, hashFunction, hashFunctionName) {
    const hashes = getHashCollisions(dataset, hashFunction);
    const totalCollisions = Array.from(hashes.values()).filter(currentCollisionCount => currentCollisionCount > 1).length;
    const highestCollisionCount = Array.from(hashes.values()).reduce((currentHighest, current) => current > currentHighest ? current : currentHighest) - 1;
    return `${hashFunctionName}: ${totalCollisions} collisions, highest collision count: ${highestCollisionCount}`;
}

const dataset = generateDataset(100, 100);

const literalHashesReport = calculateAndGenerateReport(dataset, point => point.x + point.y, "literal hash function:");
const onePrimeHashesReport = calculateAndGenerateReport(dataset, point => 31 * point.x + point.y, "one prime multiplication hash function:");
const twoPrimesHashesReport = calculateAndGenerateReport(dataset, point => 31 * point.x + 37 * point.y, "two primes multiplication hash function:");
const twoLargePrimesHashesReport = calculateAndGenerateReport(dataset, point => 8191 * point.x + 131071 * point.y, "two large primes multiplication hash function:");

console.log(literalHashesReport);
console.log(onePrimeHashesReport);
console.log(twoPrimesHashesReport);
console.log(twoLargePrimesHashesReport)

Result:

literal hash function: 197 collisions, highest collision count: 99
one prime multiplication hash function: 3107 collisions, highest collision count: 3
two primes multiplication hash function: 3359 collisions, highest collision count: 2
two large primes multiplication hash function: 0 collisions, highest collision count: 0

This shows that the (prime) numbers we choose to "calculate" the hash greatly decreases the probability for collisions.

Titulum
  • 9,928
  • 11
  • 41
  • 79