14

I have a java applciation in which I want to generate long ids for strings (in order to store those strings in neo4j). In order to avoid data duplication, I would like to generate an id for each string stored in a long integer, which should be unique for each string. How can I do that ?

Riduidel
  • 22,052
  • 14
  • 85
  • 185
  • Couldn't you just get the hash of the Strings and cast them to long before storing in neo? – Marthin Feb 16 '12 at 10:38
  • 5
    You cannot achieve "unique for all strings" - long has 64 bits, a string of length 9 has 72 bits, there got to be some strings which will be hashed to the same long – amit Feb 16 '12 at 10:38
  • 1
    You can't get uniqueness, since there are infintitely many strings and only finitely many longs. Can you describe more specifically what you're looking for? – templatetypedef Feb 16 '12 at 10:39

5 Answers5

25

This code will calculate pretty good hash:

String s = "some string";
long hash = UUID.nameUUIDFromBytes(s.getBytes()).getMostSignificantBits();
Ran
  • 462
  • 7
  • 15
7

Why don't you have a look a the hashcode() function of String, and just adopt it to using long values instead?

Btw. if there was a way to create a unique ID for each String, then you would have found a compression algorithm that would be able to pack every String into 8 bytes (not possible by definition).

Daniel
  • 27,718
  • 20
  • 89
  • 133
5

long has 64 bits. A String of length 9 has 72 bits. from pigeon hole principle - you cannot get a unique hashing for 9 chars long strings to a long.

If you still want a long hash: You can just take two standard [different!] hash functions for String->int, hash1() and hash2() and calculate: hash(s) = 2^32* hash1(s) + hash2(s)

amit
  • 175,853
  • 27
  • 231
  • 333
  • 1
    A character in java is UTC-16 thus a `String` of length 9 is 144 bits. Also, using two hash functions make so sense. What you want is simply a hashing algorithm that can operate on data of variable length such as MD5 for instance. For sure there will be collisions, but it will remain close to as minimal as it can be with 64 bits of cardinality. – Olivier Giniaux May 25 '22 at 12:19
2

A simple 64 bits hash can be implement by combining CRC32 with Adler32.

Example in Java:

package com.example;

import java.util.zip.Adler32;
import java.util.zip.CRC32;

public class MySimpleHash {

    /**
     * Calculate a 64 bits hash by combining CRC32 with Adler32.
     * 
     * @param bytes a byte array
     * @return a hash number
     */
    public static long getHash(byte[] bytes) {

        CRC32 crc32 = new CRC32();
        Adler32 adl32 = new Adler32();

        crc32.update(bytes);
        adl32.update(bytes);

        long crc = crc32.getValue();
        long adl = adl32.getValue();

        return (crc << 32) | adl;
    }

    public static void main(String[] args) {
        String string = "This is a test string";
        long hash = getHash(string.getBytes());
        System.out.println("output: " + hash);
    }
}
output: 7732385261082445741

Example in Python:

#!/usr/bin/python3

import zlib

def get_hash(bytes):
    return zlib.crc32(bytes) << 32 | zlib.adler32(bytes)

string = "This is a test string"
hash = get_hash(string.encode())
print("output:", hash)
output: 7732385261082445741

This Gist compares some hash methods: https://gist.github.com/fabiolimace/507eac3d35900050eeb9772e5b1871ba

fabiolimace
  • 972
  • 11
  • 13
1

There are many answers, try the following:

Or, as suggested before, check out the sources.

PS. One more technique is to maintain a dictionary of strings: since you're unlikely to get 264 strings any time soon, you can have perfect mapping. Note though that that mapping may as well become a major bottleneck.

alf
  • 8,377
  • 24
  • 45