2

Let's say we have a collection of byte strings, sorted in lexicographic order, as usual. We want to define a hash function, mapping a string to an integer, in such a way that ordering of hash values preserves the ordering of the strings to a sufficient degree. That is, given string A being lesser or equal to string B, H(A) should always yield a value, which is lesser or equal to H(B).

Clearly, a not so good hash function of this sort is possible. For example, we can take a fixed prefix of each string (say, 8 bytes) and pretend it to be a big-endian unsigned int64. The resulting integers will be sorted in a desirable order. This approach even works for shorter strings: we can append some 0s to a short string to make it at least prefix bytes long (but only if we can assume that 0 is not a valid byte value).

Unfortunately, this potential solution, while fast and easy, has major drawbacks. It becomes rather useless in cases where strings tend to feature sizeable common prefixes. It can not handle strings shorter than chosen prefix when '0x00' is a meaningful byte and we want to sort shorter strings before longer ones.

So the question is whether it is possible to do any better? Some arithmetic (or rather Knuth's "Concrete Mathematics" sort of) trick which can consider all the bytes of the string and yield an appropriately ordered hash value?

Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
oakad
  • 6,945
  • 1
  • 22
  • 31
  • What your describing seems to have little in common with what most people refer to as "hash" functions. What "hashy" properties of this function do you want? – President James K. Polk Mar 17 '22 at 16:49
  • 1
    Where did this problem come from? A very similar question was asked yesterday by someone who said they 'thought up the problem themselves': [How to sort non-numeric strings by converting them to integers? Is there a way to convert strings to unique integers while being ordered?](https://stackoverflow.com/questions/71491613/how-to-sort-non-numeric-strings-by-converting-them-to-integers-is-there-a-way-t) – kcsquared Mar 17 '22 at 16:56
  • First sort the strings in order. Assign each string a hash value equivalent to its position in the sorted array: 0, 1, 2, 3, ... That exactly preserves the sorted order in the hash function. If you might want to add new strings, then number the initial strings: 0, 10, 20, 30, ... to allow for later insertions. You may need to re-hash the entire array if there are too many insertions in one part. – rossum Mar 17 '22 at 16:57
  • @kcsquared I missed that other question. I agree it looks similar. In my case the problem comes from key partitioning approaches in a particular database. – oakad Mar 18 '22 at 04:26
  • @rossum Nope, we don't have all the strings in advance. – oakad Mar 18 '22 at 04:26

1 Answers1

4

The best you can do is to apply an order-preserving arithmetic encoding, based on the best statistical model of the strings that you can come up with, and then take a prefix of that to form the "hash" code.

Each hash code will then be equally likely, according to that statistical model.

If your model is just that all strings are equally likely, then this reduces to your "just take a prefix idea"... so whether or not this will work for you really depends on how much you know about your strings and how good you need this code to be.

Also note that many realistic models will also allow a simpler encoding scheme. "just take a prefix" is again an example of this.

Most things that people might think they want to do with a "hash code" like this are not practical -- you will probably end up doing something else. Maybe you want to ask about your real problem, so we can help solve it in some other way.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • The problem at hand is partitioning. Given so many strings, I want to split them into partitions in such a way, that numeric partition ids follow the general order of the entire collection. – oakad Mar 18 '22 at 04:31
  • I can see people have thought about similar problems before (for example, http://rsrikant.com/papers/sigmod04.pdf). Suppose it's something I should investigate in depth. – oakad Mar 18 '22 at 04:40
  • Why do you need the order-preserving property? – Matt Timmermans Mar 18 '22 at 14:33