13

We have an app that the Python module will write data to redis shards and the Java module will read data from redis shards, so I need to implement the exact same consistent hashing algorithm for Java and Python to make sure the data can be found.

I googled around and tried several implementations, but found the Java and Python implementations are always different, can't be used togather. Need your help.

Edit, online implementations I have tried:
Java: http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html
Python: http://techspot.zzzeek.org/2012/07/07/the-absolutely-simplest-consistent-hashing-example/
http://amix.dk/blog/post/19367

Edit, attached Java (Google Guava lib used) and Python code I wrote. Code are based on the above articles.

import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
import com.google.common.hash.HashFunction;

public class ConsistentHash<T> {
    private final HashFunction hashFunction;
    private final int numberOfReplicas;
    private final SortedMap<Long, T> circle = new TreeMap<Long, T>();

    public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
            Collection<T> nodes) {
        this.hashFunction = hashFunction;
        this.numberOfReplicas = numberOfReplicas;

        for (T node : nodes) {
            add(node);
        }
    }

    public void add(T node) {
        for (int i = 0; i < numberOfReplicas; i++) {
            circle.put(hashFunction.hashString(node.toString() + i).asLong(),
                    node);
        }
    }

    public void remove(T node) {
        for (int i = 0; i < numberOfReplicas; i++) {
            circle.remove(hashFunction.hashString(node.toString() + i).asLong());
        }
    }

    public T get(Object key) {
        if (circle.isEmpty()) {
            return null;
        }
        long hash = hashFunction.hashString(key.toString()).asLong();
        if (!circle.containsKey(hash)) {
            SortedMap<Long, T> tailMap = circle.tailMap(hash);
            hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
        }
        return circle.get(hash);
    }
}

Test code:

        ArrayList<String> al = new ArrayList<String>(); 
        al.add("redis1");
        al.add("redis2");
        al.add("redis3");
        al.add("redis4");

        String[] userIds = 
        {"-84942321036308",
        "-76029520310209",
        "-68343931116147",
        "-54921760962352"
        };
        HashFunction hf = Hashing.md5();

        ConsistentHash<String> consistentHash = new ConsistentHash<String>(hf, 100, al); 
        for (String userId : userIds) {
            System.out.println(consistentHash.get(userId));
        }

Python code:

import bisect
import md5

class ConsistentHashRing(object):
    """Implement a consistent hashing ring."""

    def __init__(self, replicas=100):
        """Create a new ConsistentHashRing.

        :param replicas: number of replicas.

        """
        self.replicas = replicas
        self._keys = []
        self._nodes = {}

    def _hash(self, key):
        """Given a string key, return a hash value."""

        return long(md5.md5(key).hexdigest(), 16)

    def _repl_iterator(self, nodename):
        """Given a node name, return an iterable of replica hashes."""

        return (self._hash("%s%s" % (nodename, i))
                for i in xrange(self.replicas))

    def __setitem__(self, nodename, node):
        """Add a node, given its name.

        The given nodename is hashed
        among the number of replicas.

        """
        for hash_ in self._repl_iterator(nodename):
            if hash_ in self._nodes:
                raise ValueError("Node name %r is "
                            "already present" % nodename)
            self._nodes[hash_] = node
            bisect.insort(self._keys, hash_)

    def __delitem__(self, nodename):
        """Remove a node, given its name."""

        for hash_ in self._repl_iterator(nodename):
            # will raise KeyError for nonexistent node name
            del self._nodes[hash_]
            index = bisect.bisect_left(self._keys, hash_)
            del self._keys[index]

    def __getitem__(self, key):
        """Return a node, given a key.

        The node replica with a hash value nearest
        but not less than that of the given
        name is returned.   If the hash of the
        given name is greater than the greatest
        hash, returns the lowest hashed node.

        """
        hash_ = self._hash(key)
        start = bisect.bisect(self._keys, hash_)
        if start == len(self._keys):
            start = 0
        return self._nodes[self._keys[start]]

Test code:

import ConsistentHashRing

if __name__ == '__main__':
    server_infos = ["redis1", "redis2", "redis3", "redis4"];
    hash_ring = ConsistentHashRing()
    test_keys = ["-84942321036308",
        "-76029520310209",
        "-68343931116147",
        "-54921760962352",
        "-53401599829545"
        ];

    for server in server_infos:
        hash_ring[server] = server

    for key in test_keys:
        print str(hash_ring[key])
superche
  • 753
  • 3
  • 9
  • 21
  • If you have code that you have already written and that you are having trouble with, please include the relevant parts in your question. We can't help you with your code if you don't show it. – Greg Hewgill Sep 11 '12 at 03:40
  • What hash functions have you considered? How are you calling them? Do standard cryptographic hash functions like SHA-1 not meet your needs? – Brendan Long Sep 11 '12 at 03:42
  • So where's your code implementing those hash functions and what didn't work about them? – Brendan Long Sep 11 '12 at 03:46
  • @BrendanLong in the examples I have tried, they use MD5. My requirement is it's the same consistent hashing implementation for both Java and Python. – superche Sep 11 '12 at 03:50
  • 1
    @superche are you saying that the outcome of hashing your keys with MD5 in Python and hashing your keys with MD5 in Java is different? If so, you should check that you really are hashing the same string/keys/whatever. And you should post the code that isn't working. – Jeff Tratner Sep 11 '12 at 04:00
  • 4
    I wonder if your data is encoded differently in Python vs Java, since they both have different ideas of what the default string encoding should be. – Brendan Long Sep 11 '12 at 04:01
  • @BrendanLong I add # coding: utf-8 to the Python code, should the encoding be same as Java? – superche Sep 11 '12 at 05:25
  • @superche what matters here might be data encoding, not code encoding – Kos Sep 11 '12 at 05:40
  • @Kos I have attached the code, data is plain text. – superche Sep 11 '12 at 06:42

7 Answers7

10

You seem to be running into two issues simultaneously: encoding issues and representation issues.

Encoding issues come about particularly since you appear to be using Python 2 - Python 2's str type is not at all like Java's String type, and is actually more like a Java array of byte. But Java's String.getBytes() isn't guaranteed to give you a byte array with the same contents as a Python str (they probably use compatible encodings, but aren't guaranteed to - even if this fix doesn't change things, it's a good idea in general to avoid problems in the future).

So, the way around this is to use a Python type that behaves like Java's String, and convert the corresponding objects from both languages to bytes specifying the same encoding. From the Python side, this means you want to use the unicode type, which is the default string literal type if you are using Python 3, or put this near the top of your .py file:

from __future__ import unicode_literals

If neither of those is an option, specify your string literals this way:

u'text'

The u at the front forces it to unicode. This can then be converted to bytes using its encode method, which takes (unsurprisingly) an encoding:

u'text'.encode('utf-8')

From the Java side, there is an overloaded version of String.getBytes that takes an encoding - but it takes it as a java.nio.Charset rather than a string - so, you'll want to do:

"text".getBytes(java.nio.charset.Charset.forName("UTF-8"))

These will give you equivalent sequences of bytes in both languages, so that the hashes have the same input and will give you the same answer.

The other issue you may have is representation, depending on which hash function you use. Python's hashlib (which is the preferred implementation of md5 and other cryptographic hashes since Python 2.5) is exactly compatible with Java's MessageDigest in this - they both give bytes, so their output should be equivalent.

Python's zlib.crc32 and Java's java.util.zip.CRC32, on the other hand, both give numeric results - but Java's is always an unsigned 64 bit number, while Python's (in Python 2) is a signed 32 bit number (in Python 3, its now an unsigned 32-bit number, so this problem goes away). To convert a signed result to an unsigned one, do: result & 0xffffffff, and the result should be comparable to the Java one.

Mark
  • 3,137
  • 4
  • 39
  • 76
lvc
  • 34,233
  • 10
  • 73
  • 98
3

According to this analysis of hash functions:

Murmur2, Meiyan, SBox, and CRC32 provide good performance for all kinds of keys. They can be recommended as general-purpose hashing functions on x86.

Hardware-accelerated CRC (labeled iSCSI CRC in the table) is the fastest hash function on the recent Core i5/i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors.

Python has zlib.crc32 and Java has a CRC32 class. Since it's a standard algorithm, you should get the same result in both languages.

MurmurHash 3 is available in Google Guava (a very useful Java library) and in pyfasthash for Python.

Note that these aren't cryptographic hash functions, so they're fast but don't provide the same guarantees. If these hashes are important for security, use a cryptographic hash.

Community
  • 1
  • 1
Brendan Long
  • 53,280
  • 21
  • 146
  • 188
2

Differnt language implementations of a hashing algorithm does not make the hash value different. The SHA-1 hash whether generated in java or python will be the same.

basiljames
  • 4,777
  • 4
  • 24
  • 41
2

I'm not familiar with Redis, but the Python example appears to be hashing keys, so I'm assuming we're talking about some sort of HashMap implementation.

Your python example appears to be using MD5 hashes, which will be the same in both Java and Python.

Here is an example of MD5 hashing in Java:

http://www.dzone.com/snippets/get-md5-hash-few-lines-java

And in Python:

http://docs.python.org/library/md5.html

Now, you may want to find a faster hashing algorithm. MD5 is focused on cryptographic security, which isn't really needed in this case.

Chris Bode
  • 1,265
  • 7
  • 16
  • MD5 hash is the same for both Java and Python, but when try to implemnt consistent hashing for keys, it's different because of the different data types in Java and Python. – superche Sep 11 '12 at 04:06
  • What data types are you having a problem with? The simplest solution is probably to convert them to strings before hashing.. – Brendan Long Sep 11 '12 at 04:10
  • @superche I was assuming text keys, which may not be the case here. If the keys aren't text, he's going to need to convert them to some common format in any solution. In his sample code, the keys seem to be strings, so MD5 should function just fine. – Chris Bode Sep 11 '12 at 04:41
  • I didn't see the MD5 hashing in Java and Python to be the same. I tested with text keys, "-84942321036308", I get (195940655746834055544853328800610818493) for Python and (8357636395451350515) for Java. Can you show me your code? – superche Sep 11 '12 at 07:01
  • 1
    The MD5 of "-84942321036308" is `9368cc30fe241dcba2beb130bfe499bd`, or `195940655746834055544853328800610818493` in decimal. What you claim java gave you isn't even a MD5 hash, which are all 128-bit. Of course, java doesn't have a 128-bit numeric data type outside of BigInteger, which you really shouldn't be using for hashes. AS I said, you may want to find a faster hashing algorithm. MD5 is designed to not be reversible. There are simpler ways to do this. Let me update my answer. – Chris Bode Sep 11 '12 at 13:37
2

Here is a simple hashing function that produces the same result on both python and java for your keys:

Python

def hash(key):
        h = 0
        for c in key:
                h = ((h*37) + ord(c)) & 0xFFFFFFFF
        return h;

Java

public static int hash(String key) {
    int h = 0;
    for (char c : key.toCharArray())
        h = (h * 37 + c) & 0xFFFFFFFF;
    return h;
}

You don't need a cryptographically secure hash for this. That's just overkill.

Chris Bode
  • 1,265
  • 7
  • 16
  • I spent a really long time thinking that you had used the standard JDK `String.hashCode()` function, but it turns out you used 37 as your base, while the JDK uses 31. – traviscj Feb 26 '15 at 22:43
1

Let's get this straight: the same binary input to the same hash function (SHA-1, MD5, ...) in different environments/implementations (Python, Java, ...) will yield the same binary output. That's because these hash functions are implemented according to standards.

Hence, you will discover the sources of the problem(s) you experience when answering these questions:

  • do you provide the same binary input to both hash functions (e.g. MD5 in Python and Java)?

  • do you interpret the binary output of both hash functions (e.g. MD5 in Python and Java) equivalently?

@lvc's answer provides much more detail on these questions.

Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130
0

For the java version, I would recommend using MD5 which generates 128bit string result and it can then be converted into BigInteger (Integer and Long are not enough to hold 128bit data).

Sample code here:

private static class HashFunc {

    static MessageDigest md5;

    static {
        try {
            md5 = MessageDigest.getInstance("MD5");
        } catch (NoSuchAlgorithmException e) {
            //
        }
    }

    public synchronized int hash(String s) {
        md5.update(StandardCharsets.UTF_8.encode(s));
        return new BigInteger(1, md5.digest()).intValue();
    }
}

Note that:

The java.math.BigInteger.intValue() converts this BigInteger to an int. This conversion is analogous to a narrowing primitive conversion from long to int. If this BigInteger is too big to fit in an int, only the low-order 32 bits are returned. This conversion can lose information about the overall magnitude of the BigInteger value as well as return a result with the opposite sign.

Q i
  • 320
  • 5
  • 12