Assuming that most word pairs are not anagrams, the case you most need to optimise is the failure case.
Obviously if the lengths are different then the strings cannot be anagrams, and the length is a property that is computed once when the string is created, making it a very efficient thing to compare as a fast-out.
If you change your string class to include more of these properties you can improve the accuracy of the fast-out case. Rather than beginning the test function with:
if (s1.length() != s2.length()) return false;
You could use:
if (s1.hash != s2.hash) return false;
and when you create the string (or after you modify it), you would compute a value for hash
which is unique (or nearly unique) to all strings with that set of letters in arbitrary order.
You only compute this hash when a string changes; not for every comparison that you do.
A very simple way to compute the hash is:
long sum = 0;
for (int i = 0; i < s.length(); i++)
sum += s[i];
s.hash = sum;
this is quick to compute, but is prone to collisions.
A more advanced method:
static const unsigned int primetable[256] = { 1, 3, 5, 7, /* ... */ };
unsigned long prod = 1;
for (int i = 0; i < s.length(); i++)
prod *= primetable[s[i]];
s.hash = prod;
Note that two is omitted from the list of primes. This ensures that prod
is always co-prime with the possible range of unsigned long
. This maximises the distribution of results in the face of numerical overflow.
If the table is arranged to put small primes at frequent letter positions, then the cases where overflow occurs (which can lead to hash collisions) can be minimised. If there's no overflow then you have a perfect hash (for these purposes) and you would have absolute certainty both ways (return either true
or false
immediately) by just comparing hash
.
Consequently, computing the hash like this works out much better:
static const unsigned int primetable[256] = {
1, 1429, 1433, 1439, 1447, 1451, 1453, 1459, 1471, 1481, 1483, 1487, 1489, 1493, 1499, 1511, 1523, 1531, 1543, 1549, 1553, 1559, 1567, 1571, 1579, 1583, 1597, 1601, 1607, 1609, 1613, 1619,
821, 823, 827, 829, 839, 853, 857, 107, 859, 863, 877, 881, 883, 109, 887, 907, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 911, 919, 929, 937, 941, 947,
601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797, 809, 811,
359, 11, 61, 31, 37, 3, 71, 43, 59, 7, 101, 79, 29, 53, 13, 23, 47, 103, 17, 5, 19, 41, 73, 83, 89, 67, 97, 367, 373, 379, 383, 389,
1181, 1187, 1193, 1201, 1213, 1217, 1223, 1229, 1231, 1237, 1249, 1259, 1277, 1279, 1283, 1289, 1291, 1297, 1301, 1303, 1307, 1319, 1321, 1327, 1361, 1367, 1373, 1381, 1399, 1409, 1423, 1427,
953, 967, 971, 977, 983, 991, 997, 1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049, 1051, 1061, 1063, 1069, 1087, 1091, 1093, 1097, 1103, 1109, 1117, 1123, 1129, 1151, 1153, 1163, 1171,
397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599,
173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353
};
unsigned long prod = 1;
boolean overflow = false;
for (int i = 0; i < s.length(); i++)
{
overflow |= (prod > ULONG_MAX / primetable[s[i]]);
prod *= primetable[s[i]];
}
if (overflow)
prod ^= 1;
s.hash = prod;
with fast-outs:
if (s1.hash != s2.hash) return false;
if ((s1.hash & 1) != 0) return true;
if (s1.length() != s2.length()) return false;
The middle line is only safe to use if the character encoding is not multi-byte. If you are working with a multi-byte coding scheme then the hash will still eliminate most non-anagrams, but it will have a lot of false positives as some byte orderings cannot be ignored.
Hacking Mats Petersson's test code to read from a dictionary, and trying this and the other algorithms on real English dictionary input we see roughly a factor of four improvement over the next best algorithm (it was a factor of ten, but I tweaked the other code):
Functionis_anagram time: 218.9s hits: 93
Functionis_anagram1 time: 200s hits: 93
Functionis_anagram2 time: 40s hits: 93
Functionis_anagram3 time: 7.74s hits: 93
Functionis_anagram4 time: 2.65s hits: 93
Functionis_anagram5 time: 2.3s hits: 166
Functionis_anagram6 time: 0.56s hits: 166 <- This method
(the number of hits is different because the last two algorithms are both case-insensitive and my dictionary probably included acronyms colliding with natural words)
update: Although it's not what was asked, it was negligent of me to not point this out. I don't know if I didn't spot it or if I just got sick of typing or if I didn't want to make assumptions about the calling code...
Once you've hashed all the words, a trivial way to minimise the number of tests is to sort the list by that hash. Then you can trivially limit comparisons to parts of the list that are likely matches (according to the hash). This can also make branch prediction more efficient, too.
I just tried changing the N^2 iteration I tested with originally (I'm sure that was deliberately inefficient) to iterate over neighbours in a sorted list. The sort()
call dominated the timing, but was 200x faster than the fastest N^2 test, and the choice of comparison algorithm no longer made any meaningful difference to performance.
Or you could just bin words by hash as you receive them.