Perfect/ideal hash to isolate anagrams

Question

In an effort to accelerate fast-out behaviour on testing strings for anagrams, I came up with a prime-based hashing scheme -- although it looks like I wasn't the first.

The basic idea is to map letters to prime numbers, and to compute the product of these primes. Any rearrangement of the letters will have the same product, and if the result can be arbitrarily large then no combination of other letters can produce the same result.

I had initially envisioned this as just a hash. Eventually the product would overflow and start to alias other letter combinations. However, by mapping the most frequent letters to the smallest primes the product grows slowly and can often avoid overflow altogether. In this case we get a perfect hash, giving both definite positive and negative results without additional testing.

What's notable is that it doesn't fill the coding space very efficiently before overflowing. No result will have any prime factors greater than 103, and the distribution of small primes is fixed and not necessarily a great match to letter frequency.

Now I'm wondering if there's something substantially better than this. Something that covers more results with perfect hashes and has strong distribution in the remaining cases.

The densest coding scheme I can think of is to sort the letters and then pack them into a word with an entropy coder. In this scheme the letter frequency will obviously be enormously biased because of the range constraints applied to each position (eg., the likelihood of a sorted array starting with z is substantially lower than that of a sorted array ending with a z).

That sounds like a whole lot of work, though -- and I can't see it guaranteeing to give good distribution in the overflow case.

Perhaps there's a better set of factors to map the letters to, and a better way to detect when the risk of aliasing has started. Or a hashing scheme that doesn't rely on multiplication? Something that's easy to calculate?

So that's:

A perfect hash for as much real-world input as possible (for some sensible number of bits).
A strong hash for remaining cases, with a means of distinguishing the two cases.
Easy to calculate.

English language constraints (26 letters with typical English-like word structure) will do fine. Multi-byte coding schemes are a whole other problem.

C code preferred because I understand it.

Why do you need the hash to be perfect? If the hash maps anagrams to the same product of primes, then it also maps them to that same product modulo some arbitrary modulus. Just decide how big you want your hashtable to be, take the product mod that size, and deal with the occasional collision as you would with any other hash table algorithm. — Lee Daniel Crocker, Aug 10 '13 at 13:39
@LeeDanielCrocker, in the cases where it's perfect (and it knows it), then in the case where you see that two strings have the same hash, you know immediately that they _are_ anagrams. If the hash is not perfect then you know only that you have to do further work to be sure. I wouldn't normally hope to exploit that, but it turns out that a 32-bit digest (including one bit marker) already captures about 50% of dictionary words perfectly. So it's probably possible to make 20-30 bits perfect for practically the whole language. — sh1, Aug 10 '13 at 13:51
The problem I see with this is that you get very large numbers very quickly. If we start with 3 * 5 * 7 * 11 * 13 * 17 * 19 * 23 * 29 * 31, the number is bigger than 32 bits. If we take the word (from the original "adventure" game) "xyzzy", we get 97 * 101 * 103 * 103 * 101, which is already out of the range of a 32 bit number too. So you will need large integers to deal with the numbers if you want to deal with a little longer anagrams... — Mats Petersson, Aug 10 '13 at 13:55
I just wrote a bit of code to test this out. The word pair of "pterygoplichtys" and "glyptoperichthys" (scientific names for two genera of suckermouth catfish) does indeed match up, but I also get two unsigned 64-bit overflows from that, and "Supercalifragilisticexpialidocious" gives overflow several times over. I'm not convinced that the overflow is "safe" (in other words, you may find that there are false matches) if you run enough long words through it. — Mats Petersson, Aug 10 '13 at 14:09
@sh1, regarding your thought about creating a denser coding scheme by sorting the letters. Since the letters are sorted, maybe encoding the RLE will work better? — Vadim, Aug 10 '13 at 17:06
@MatsPetersson, You certainly get some false matches somewhere -- although I'm having a hard time finding them. You can only declare a definite match if the hash doesn't overflow (also maybe if you know some advanced mathematics), but non-overflowing hashes with the stated algorithm apply to most words less than six or seven letters long (longer than average in English). In the remaining cases the hash merely eliminates most of the junk before leaving you to do the hard work to compare accurately. — sh1, Aug 10 '13 at 17:41
Absolutely. Not sure multiplication is that great a method for creating a hash, but I guess it beats having to compare a huge number of words. — Mats Petersson, Aug 10 '13 at 17:46
Quick check with my "anagram compare benchmark", for a single string vs. another single string, it's about half the speed of "count up/down for each letter in the string". But if you have a long list of words, and want to check which words are actually anagrams of some other word in the same or another list, then it seems plausible. — Mats Petersson, Aug 10 '13 at 17:53
@MatsPetersson, I was working from the position of pre-computed hashes being the most commonly used mechanism for eliminating collisions. If you do store words in a hash table, it won't be a 64-bit table, so it's worth keeping the full-sized, pre-computed hash as part of the object to cull the collisions. — sh1, Aug 11 '13 at 11:18
Indeed, store the full hash, and if it compares as equal, compare the string. If the hash is not a match, then it's not an anagram. — Mats Petersson, Aug 11 '13 at 11:20
@Vadim, it looks like flancor's solution does, in effect, use RLE as you suggest. — sh1, Aug 13 '13 at 11:14

score 1 · Accepted Answer · edited May 23 '17 at 12:11

1

If you are using n-bit hashes with an alphabet of size m, you can get a unique hash for anagrams up to (n-m) characters long using the approach I described here. This makes collision detection unnecessary but it does limit your word size depending on the size of the alphabet and your available space.

To allow words of any length, I would use n-1 bits to do that hash for words up to (n-m-1) characters in length, and save the last bit to signal that the word is m characters or longer. In those cases you would use the remaining n-1 bits for your prime-number or other hashing algorithm, but of course you would have do collision detection anytime you got multiple words in those buckets. Since in a real-world application the majority of the words will occupy the shorter word lengths, you'll drastically cut the collision detection needed for the longer words.

edited May 23 '17 at 12:11

Community

1
1

answered Aug 11 '13 at 17:30

flancor

158
1
6

What you have here is the histogram coded with [variable-length codes](http://en.wikipedia.org/wiki/Variable-length_code), specifically [unary coding](http://en.wikipedia.org/wiki/Unary_coding). So that actually is an entropy encoding of the sorted list, except that you've implicitly encoded even absent letters but with very short codes. In fact, the zeroes represent _another_ unary encoding of the _difference_ between the letters that are present in the word. – sh1 Aug 11 '13 at 19:29
If you have letter-frequency statistics to hand, you can optimise this encoding by moving the least-frequently-used letters to the end. Even if you run out of space before the end of the word, you don't lose any information provided the remaining letters were not present (you just infer all the zeroes you need to complete the code). Further optimisations involve tweaking the letter-count coding scheme and intra-letter coding scheme to be optimal variable-length codes (I believe they are already close), and using an arithmetic rather than binary code (hard work!). – sh1 Aug 11 '13 at 19:43
@sh1 Optimizing my algorithm using letter frequency in the way you describe will generate collisions, which would mean you lose the only benefit of the algorithm: no collision checking. Using letter frequency without generating collisions shouldn't create any tangible benefits(it depends on how you code the hash function.) Such optimizations are great for the second hashing of words with length >= m, because you have to check for collisions there anyway. – flancor Aug 11 '13 at 22:37
@sh1 I challenge you to optimize my coding scheme as presented in the original answer :D As I pointed out, there is a one-to-one correspondence of possible hashes to possible anagrams, so it's impossible to guarantee uniqueness with a smaller hash. Now, if you make assumptions about the data(for example, no english word has 4 or more of the same letter in a row) you might be able to prune a couple of bits, but it's dangerous to do this with no backup plan(collision detection.) Could you clarify what you mean about an arithmetic rather than a binary code? Thanks! – flancor Aug 11 '13 at 22:42
I mean to sort by global letter frequency, not letter frequency in the one word. Assuming 'e' is the most common letter, you _always_ emit that first instead of 'a', etc., even if the specific word doesn't use 'e' itself. By the time you get near the end you've probably already mentioned every letter and the rest of the code is zero and you can cut it short. If you don't use 'z' then you have room for a word one or two letters longer, and if you don't use 'q' either, that's another two letters, etc.. With 'y' near the end you have a strong chance of missing that opportunity. – sh1 Aug 11 '13 at 23:07
Will get to the other bit later, but that principle of arranging the data so that the later parts are _likely_ to be zero is used in image compression. Taking a 16x16 piece of the image, apply a [DCT](http://en.wikipedia.org/wiki/Discrete_cosine_transform) and [zig-zag ordering](http://en.wikipedia.org/wiki/JPEG#Entropy_coding), and then the last values are likely to be very small or zero so you cut them off. – sh1 Aug 11 '13 at 23:15
You're absolutely right. I didn't consider the possibility of having a variable cutoff length between the unique hashes and the "second pass." Although with this improvement you now have to attempt the first hash to see if the second is necessary, rather than just looking at the word length. – flancor Aug 11 '13 at 23:36
Right, so the process to optimise this technique to capture more words is to view it as a compression scheme. Specifically an entropy coder. The letter counts for the whole alphabet are coded as variable-length codes with the shortest code (1 bit) being assigned to a count of 0 (most likely for normal words). A potential for optimisation comes encoding runs of several zeroes more efficiently than using a whole bit for each zero. Eg., for an 8-letter word (and maybe more), two zeroes in a row is more likely than one, so we could use a shorter code for two zeroes than for one zero. – sh1 Aug 13 '13 at 10:51
The whole thing can be considered alternating codes for 'next letter' and 'number of letters'; but rather than using standard unary code, you can re-balance the code lengths using all of the available insights about letter frequency and distance -- you would use a [Huffman code](http://en.wikipedia.org/wiki/Huffman_coding) for this. [Arithmetic coding](http://en.wikipedia.org/wiki/Arithmetic_coding) extends on this principle, but rather than using bit-shifts it uses multiplication, which makes it more flexible, and doesn't have to round code lengths to powers of two. – sh1 Aug 13 '13 at 10:58
If the 'compressed' code turns out too long to fit in the hash size then we have a collision risk, but we might as well persist with the same coding scheme by pressing on, and just folding the code back over on itself using some kind of generic hashing algorithm, like murmur hash, to keep it within the required size, and adding a marker bit to indicate that it's imperfect. – sh1 Aug 13 '13 at 11:09
Sorry, huffman rounds code lengths to integer numbers of bits, not powers of two, whereas arithmetic allows fractional code lengths. – sh1 Aug 13 '13 at 11:13

sh1 · Answer 2 · 2013-08-16T16:49:58.113

Implemented flancor's answer here: https://github.com/sh1boot/anagram/blob/master/bitgram.c

Using http://ftp.debian.org/debian/pool/main/s/scowl/wbritish_7.1-1_all.deb as the dictionary, I get the following proportions of perfect codes (both positive and negative matches from the hash comparison) using these coding schemes:

order  | frequency|  alphabet| frequency|  alphabet| frequency
code   |    prime |    unary |    unary |  huffman |  huffman
-------|----------|----------|----------|----------|----------
64-bit |   93.95% |  100.00% |  100.00% |  100.00% |  100.00%
32-bit |   24.60% |   69.14% |   74.23% |   86.57% |   90.58%
28-bit |   15.20% |   41.75% |   60.27% |   68.16% |   74.34%
24-bit |    8.21% |   13.14% |   38.55% |   42.67% |   50.34%
20-bit |    3.72% |    3.35% |   16.73% |   19.41% |   25.59%
16-bit |    1.31% |    0.82% |    4.86% |    5.51% |    9.18%

This shows that any variant of the bit-packed histogram encoding can capture the whole dictionary in perfect 64-bit hashes, and more than two thirds of the dictionary in 32-bit hashes; whereas the prime products scheme struggles to reach 100% even in 64-bit, and can't even cover a quarter of the dictionary in a 32-bit hash.

This list does contain a lot of apostrophes, though, and almost all (but not all) of the 64-bit misses include apostrophes. Putting that character in its proper place in the frequency table might help.

Some additional tests:

frequency/huffman, overflowed:

88aff7eb53ef9940: Pneumonoultramicroscopicsilicovolcanoconiosis

frequency/huffman, 64-bit perfect:

17415fbc30c2ebf7: Chargoggagoggmanchauggagoggchaubunagungamaugg
018a6b5dda873fba: Supercalifragilisticexpialidocious
00021ae1bcf50dba: Pseudopseudohypoparathyroidism
00070003dd83b5fb: Floccinaucinihilipilification
00002b800ebbdf7a: Antidisestablishmentarianism
000006c6ab75b3f9: Honorificabilitudinitatibus
0000000421b2ad94: glyptoperichthys
0000000201b2ad94: pterygoplichtys

It's probably the case that the tuning should be done against words that are likely to fall near the boundary, as they may have properties (eg., language of origin) which strongly influence their distribution, and making shorter words longer doesn't matter until they overflow.

Perfect/ideal hash to isolate anagrams

2 Answers2

Linked