1

Given a list of numbers, for example some unique integer or long ID's what would be an optimum way to compute a reproducible 'signature' (preferably irregardless of element order)?

The use case is to detect whether any of the IDs have been added or removed from a list (of objects).

Java's array.hashCode() does not fit the bill, because even if it is apparently consistent between JVM invocations it returns a different hash if the order of elements changes or if another instance with the same elements is created:

int[] ids1 = {1, 2, 3};
System.out.println(ids1.hashCode());
// output: 980546781

int[] ids1Copy = {1, 2, 3};
System.out.println(ids1Copy.hashCode());
// output: 2061475679

int[] ids2 = {2, 1, 3};
System.out.println(ids2.hashCode());
// output: 140435067

My understanding is that ids1.hashCode() computes the hash for the memory address of the array and not a cumulative hash code for the primitive elements in the array.

What other approaches could be used in this case apart from hashing each element separately?

Anmol Singh Jaggi
  • 8,376
  • 4
  • 36
  • 77
ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • 1
    A _unique_ signature? Unless there are some constraints on your numbers that you're not mentioning, I don't see how that is possible. The domain of _lists of numbers_ has more values than the domain of ints or longs that you want to use for the signature. – khelwood Aug 10 '20 at 18:58
  • 1
    [`Arrays.deepHashCode()`](https://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#deepHashCode(java.lang.Object[])) – daniu Aug 10 '20 at 18:58
  • 1
    If you don't care about the order you might want to use a `HashSet`'s `hashCode()` method. – Progman Aug 10 '20 at 19:01
  • @daniu: `deepHashCode()` is almost the answer, except that it changes if the order is not the same so would need to be combined with sorting to be order-independent. – ccpizza Aug 10 '20 at 19:03
  • @ccpizza What do you mean by "optimum" in your case/situation? – Progman Aug 10 '20 at 19:04
  • @Progman: by optimum I mean without doing too much CPU work, i.e. fast enough for big lists. – ccpizza Aug 10 '20 at 19:06
  • Using HashSet is wrong as hashcode of {1, 2, 3} and {1, 2, 3, 3} are same. Infact, its impossible to represent {1, 2, 3, 3} as a HashSet since duplicates are not allowed. If Java had a multiset like [Guava's](https://guava.dev/releases/18.0/api/docs/com/google/common/collect/Multiset.html), then it would be apt for this. – Anmol Singh Jaggi Aug 10 '20 at 19:11
  • @Progman: `HashSet` does exactly what is needed as there are no duplicate elements; could you please add it as an answer so that I can accept it? – ccpizza Aug 10 '20 at 19:12
  • `Arrays.deepHashCode()` is incorrect as well since the result {1, 2, 3} and {2, 3, 1} are different. – Anmol Singh Jaggi Aug 10 '20 at 19:14
  • @AnmolSinghJaggi: `HashSet` serves the purpose since I am dealing with object ID's so there is no chance there will be duplicates (unless there is a bug somewhere). – ccpizza Aug 10 '20 at 19:15
  • 1
    @ccpizza Again as khelwood suggested, 2 hashsets might return the same hashcode even though they are different. The only correct answer is converting them to hashmap first. – Anmol Singh Jaggi Aug 10 '20 at 19:16
  • @AnmolSinghJaggi: for non-unique elements `Arrays.deepHashCode()` + sorting is a viable answer. – ccpizza Aug 10 '20 at 19:16
  • @ccpizza No, again hashcode is not the answer for this, no matter what data structure you use. 2 hashcodes can be same even if the 2 data structures in question are different. Its possible that [1, 2, 3] and [4, 5, 6] return the same hashcode, depending on the hash function being used. Thats why comparing hashcode to check equality is unreliable. That is why every class in Java has to override `equals()` and `hashcode()` together. – Anmol Singh Jaggi Aug 10 '20 at 19:17

5 Answers5

1

You can first create a hashmap of number vs its count in the array. Then you can just use the hashcode of the hashmap.

However, keep in mind that it might be possible (although rare) for 2 different hashmaps to return the same hashcode, as @khelwood suggested.

So if you want to reliably check if 2 lists of of numbers are same or not, you can create their frequency hashmaps as mentioned above, and then just do these checks:

  • hashmap2.size() == hashmap1.size()
  • for every (key, value) in hashmap2 { hashmap1[key] == value }

Its algorithmic time complexity is as efficient as computing and comparing hashcodes.

EDIT:

I just realized the above mentioned algorithm is what's used internally in Java HashMap equals().

So we can just create the frequency hashmaps and just check their equality using hashmap2.equals(hashmap1).

EDIT 2:

If all the numbers in an array are distinct, then you can create a hashset from them and then just check if set2.equals(set1).

Anmol Singh Jaggi
  • 8,376
  • 4
  • 36
  • 77
1

The constraint

a reproducible 'signature' (preferably irregardless of element order)

makes this problem challenging.

Here are two approaches off the top of my head:

Approach 1:

a. Sort your list of integers in O(N lg N) time.

b. Treat your list of integers as the digits in a base-M integer, where M is the largest number in your list. Suppose you have a list of integers like [A, B, C]. Then you can hash that list to be: hash = A*M^0 + B*M^1 + C*M^2. This approach is reasonable if M is a small value. You can alternatively choose a small M as a power of 2 (e.g. 2^8) and then for any integer larger than that, break up the integer into chunks of 8 bits and use the same algorithm.

Total time: O(N lg N) + O(N). Space: O(1) long int accumulator.

Approach 2:

a. Sort your list of integers in O(N lg N) time.

b. Build a string representation of your list of integer and then hash the string. For example, for a list of integers like [1, 2, 3], create a string 1_2_3 and hash it.

Total time: O(N lg N) + O(N). Space: O(N lg N) sized string.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • 1
    For ints/longs (or objects overriding `hashCode()`) once you sort the list it's possible to use `Arrays.deepHashCode() `. If the elements are unique and sorting insensitivity is not needed then `HashSet.hashCode()` might work. – ccpizza Aug 10 '20 at 20:53
  • 2
    Its time complexity is O(nlogn) which is slower than Anmol's answer which has O(n) complexity. Also in the first approach, one could easily face integer overflow error. In your second approach, you are using hashcode which is unreliable as mentioned in the other 2 answers by btilly and Anmol – Raunak Kukreja Aug 11 '20 at 14:24
  • for approach 2-b it's possible to use `Arrays.toString()` or `List.toString()` (for implementations based on `java.util.AbstractCollection`) – ccpizza Aug 11 '20 at 15:15
  • @RaunakKukreja: For 1, I mentioned that you'd use a long integer, and if that's not big enough for you, use a BigInteger. For 2, there cannot be a hash collision if all the numbers are represented as a string with delimiters. – stackoverflowuser2010 Aug 11 '20 at 17:50
0

Note that all hash based solutions are unreliable. That is, there is a chance of a collision.

Assuming that that is OK, here is a simple approach.

First, build a hash function for pairs of integers. There are lots of those available.

Next, let's do a thought exercise.

Imagine arranging all of your integers into 2^64 buckets. Then look at the counts. So an array like [2, 0, 2] becomes a list of frequency counts like, ..., 0, 0 0, 1, 0, 2, 0, 0, 0, ....

Now pair those frequency counts up with their next neighbor. So we get ..., (0, 0), (1, 0), (2, 0), (0, 0), .... Now replace each pair with its hash. Repeat. After 64 levels, we'll get a single hash representing the whole frequency count.

Now we can't actually perform this operation. However at each level most of the entries start off 0, then hash(0, 0), then hash(hash(0,0), hash(0,0)) and so on. Which are all the same. So if the data structure is a linked list with the value and two pointers, most of the pointers will just point at the generic 0-filled block data structure.

So we can write out a "tree" with all of the pointers for 0-blocks pointing at the same canonical values. And when we have this tree, inserting an element is a question of navigating the path down to the appropriate root, creating a new node with the right value, and walking back up the tree inserting new values. This takes O(64) time to do. Insert all of the values, and we get a representation of the exact frequency count of the values, signed with a hash, in time O(64 n). (Creating the same amount of data, and then being able to throw most of it away.)

But it gets better. If you have two lists with this data structure created, not only can you tell whether they are likely different, but you can actually find the differences! (The rsync utility uses a similar trick to figure out just what changed between remote files so that it can restrict how much gets copied.)

btilly
  • 43,296
  • 3
  • 59
  • 88
0

Based on the comments and feedback the following approaches have been identified (possibly unreliable because of potential hash collisions as outlined by btilly):

import java.util.Arrays;
import java.util.Collections;
import java.util.List;

public class NumberHash {

    public static void main(String[] args) {

        // ######## Arrays.deepHashCode() ########

        Integer[] ids1Sorted = {1, 2, 3};
        Integer[] ids1Unsorted = {3, 1, 2};
        
        System.out.println(Arrays.deepHashCode(ids1Sorted));
        // 30817

        Arrays.sort(ids1Unsorted);
        System.out.println(Arrays.deepHashCode(ids1Unsorted));
        // 30817


        // ######## toString() based ########

        int[] idsSorted = {1, 2, 3};
        System.out.println(Arrays.toString(idsSorted).hashCode());
        // -412129978

        int[] idsUnsorted = {3, 2, 1};
        Arrays.sort(idsUnsorted);
        System.out.println(Arrays.toString(idsUnsorted).hashCode());
        // -412129978

        List<Integer> oids = Arrays.asList(2, 3, 1);
        Collections.sort(oids);
        System.out.println(oids.toString().hashCode());
        // -412129978
    }
}
ccpizza
  • 28,968
  • 18
  • 162
  • 169
0

I would take a checksum like the CRC32 or Adler32 as unique identifier
wrapped in a lambda ready for use:

int[] yourArray = {1, 2, 3};
long checksum = Arrays.stream(yourArray).boxed().collect(Collector.of(
    CRC32::new, CRC32::update, (l, r) -> {return l;})).getValue();

{1, 2, 3}: 0x55bc801d
{1, 3, 2}: 0x3ba081ca
{2, 1, 3}: 0x7cd76d87

Kaplan
  • 2,572
  • 13
  • 14