Optimizing very often used anagram function

Question

I have written a function that determines whether two words are anagrams. Word A is an anagram of word B if you can build word B out of A just by rearranging the letters, e.g.:

lead is anagram of deal

This is my function:

bool is_anagram(std::string const & s1, std::string const & s2)
{
    auto check = [](std::string const & x)
    {
        std::map<char, unsigned> counter;
        for(auto const & c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}

This works fine, but as the number of words increases (and this function is used several million times within my application), it very soon became a major bottleneck of my application.

Does anyone have an idea of how to speed this function up?

Making the map is going to be costly. If there are only max 255 chars, make an array of 255 instead. But R's suggestion is better. — Neil Kirk, Aug 08 '13 at 10:45
The optimisation you really want is the one which avoids calling the function so often. For example, by grouping words of the same length, and by grouping words by some letter-order-insensitive hash. An example hash might be the sum of the ASCII (or whatever your coding scheme) values of each letter. — sh1, Aug 08 '13 at 10:59

nikolas · Accepted Answer · 2013-08-08T11:03:33.420

43

The map creation and your call to std::map::find within the iteration, is quite expensive.

In this case, you can use the fact that std::string behaves in many regards like a std::vector<char>, meaning that you can sort it using std::sort:

bool is_anagram(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

Instead of the two maps that you are creating, I am taking a copy of the strings (passing them by value instead of const reference) and sorting them, so

sort("lead") => "adel"
sort("deal") => "adel"

This change should already speed your algorithm up by quite a bit. One more thing that may greatly affect the performance if you tend to compare arbitrary words:

bool is_anagram(std::string s1, std::string s2)
{
    if(s1.length() != s2.length())
        return false;
    /* as above */
}

If the length of the two strings differs, it obviously cannot be an anagram. std::string::length() is a very fast operation (it needs to store the string size anyway), so we save us the hustle of O(N*log(N)) from sorting the two strings.

edited Aug 08 '13 at 11:03

answered Aug 08 '13 at 10:43

nikolas

8,707
9
50
70

1

is N * log(N) better than N? – Mats Petersson Aug 08 '13 at 10:49
4

@MatsPetersson No, because `N >= 1` and `log` is the binary logarithm. – mhjlam Aug 08 '13 at 10:52
2

My point being that counting the occurrence of each character is O(N). Sure, each O may be slower if we look up in a map, but that's perhaps better solved another way. I'm working on a couple of solutions, will post in a bit (I want to actually compare the solutions posted here with my solution, to show what the actual time is) – Mats Petersson Aug 08 '13 at 10:54
1

Please note that this solution is by far not the fastest (see below answers!) – codeling Aug 08 '13 at 12:07
7

With all due respect, but with N being typically between 2 and 6, considering big-O is ridiculous. Even `O(N^2)` will fly if `k` is small. njansen's approach looks perfectly sensible. – Damon Aug 08 '13 at 14:55
1

Because most words aren't anagrams I'd avoid using a full sort and implement one that can early out as soon as possible. Probably Quicksort. check lengths and early out if not the same. Partition both strings (in place) using a known pivot (first partition is probably around 'k' or 'l'). If resulting pivot location is not the same early out. If pivot positions at all levels are all the same it's an anagram (and the strings are sorted). The interesting question is which pivots to choose. I suspect it's a fixed choice based on word length and language. – Speed8ump Aug 08 '13 at 15:56
@nyarlathotep: I'd think this answer is fastest _if_ you cache the sorted copies for each string, since you're probably running the `is_anagram` on the same string multiple times. In context I bet this answer is fastest. – Mooing Duck Aug 08 '13 at 23:29
1

@MooingDuck that's of course true - the more often you have to run is_anagram on the same strings, the faster it will get if you cache the sorted strings. But it will also duplicate the amount of memory needed. It's a tradeoff, to be determined whether you have so many strings that implementing such a cache is really worth it, and also whether you have that amount of memory – codeling Aug 09 '13 at 06:09

score 36 · Answer 2 · answered Aug 08 '13 at 10:53

36

You've got 26 characters, make one array of the size of 26, then iterate through both words and as you encouter characters in words, increment a corresponding array element for characters in the first word and decrement corresponding array element for charaacters in the second word. If the words are anagrams, all array elements will be 0 in the end. The complexity will be just O(N) where N is the length of words.

answered Aug 08 '13 at 10:53

Karadur

1,226
9
16

19

Constants don't count in this kind of notation, O(1e10*N + 1e100) is still O(N). – Aug 08 '13 at 11:03
1

You should first also check the strings are the same length, since its an O(1) check. – Michael Anderson Aug 08 '13 at 11:22
Not necessarily. Checking the lengths of string means iteration through them (depending on how the strings are represented). This can be avoided by ignoring length checking. If the strings are different length, some array elements will not be zero. This has to be tested actually on real samples. Also the final array check can be optimised. If array type is int8[], the array can be put in, say, a union with array of int64's. The latter will be tested against zero roughly 8 times faster. – Karadur Aug 08 '13 at 11:32
3

No sane `std::string` is "iterating through the string". One of the motivations for using `std::string` is to avoid the rather frequent code to "find the end of the string" operations that you get in C style strings. This benefit would be lost if `std::string` still iterated to know it's own length. As my post shows, this method is definitely the fastest. – Mats Petersson Aug 08 '13 at 11:48

Mats Petersson · Answer 3 · 2013-08-08T11:45:20.433

Here's a selection of functions that perform anagram analysis:

#include <iostream>
#include <iomanip>
#include <map>
#include <cctype>
#include <string>
#include <algorithm>

using namespace std;

bool is_anagram(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}


bool is_anagram1(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto c : x)
        {
        ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}


bool is_anagram2(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}


bool is_anagram3(std::string s1, std::string s2)
{
    if (s1.length() != s2.length()) return false;
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

bool is_anagram4(const std::string &s1, const std::string &s2)
{
    int arr[256] = {};
    if (s1.length() != s2.length()) return false;
    for(std::string::size_type i = 0; i < s1.length(); i++)
    {
    arr[(unsigned)s1[i]]++;
    arr[(unsigned)s2[i]]--;
    }
    for(auto i : arr)
    {
    if (i) return false;
    } 
    return true;
}

bool is_anagram5(const std::string &s1, const std::string &s2)
{
    int arr[26] = {};
    if (s1.length() != s2.length()) return false;

    for(std::string::size_type i = 0; i < s1.length(); i++)
    {
    arr[(unsigned)tolower(s1[i])-'a']++;
    arr[(unsigned)tolower(s2[i])-'a']--;
    }
    for(auto i : arr)
    {
    if (i) return false;
    } 
    return true;
}


static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}


template<typename T>
void bm(const char *name, T func, 
    const std::string &s1, const std::string &s2)
{
    unsigned long long time = rdtsc();
    bool res = func(s1, s2);
    time = rdtsc()-time;
    cout << "Function" << left << setw(15) << name 
     << " time: " << right << setw(8) << time 
     << " Res: " << res << endl;
}


#define BM(x) bm(#x, x, s1, s2)

int main()
{
    const std::string short1 = "lead";
    const std::string short2 = "deal";
    const std::string long1 = "leaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddeal";
    const std::string long2 = "dealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddealleaddeallead";

    cout << "Testing with short strings:" << endl;
    std::string s1 = short1;
    std::string s2 = short2;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with long strings:" << endl;
    s1 = long1;
    s2 = long2;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with inequal short string:" << endl;
    s1 = short1;
    s2 = short2;
    s1[s1.length()-1]++;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);

    cout << "Testing with inequal long string:" << endl;
    s1 = long1;
    s2 = long2;
    s1[s1.length()-1]++;

    BM(is_anagram);
    BM(is_anagram1);
    BM(is_anagram2);
    BM(is_anagram3);
    BM(is_anagram4);
    BM(is_anagram5);
}

And the results:

Testing with short strings:
Functionis_anagram      time:    72938 Res: 1
Functionis_anagram1     time:    15694 Res: 1
Functionis_anagram2     time:    49780 Res: 1
Functionis_anagram3     time:    10424 Res: 1
Functionis_anagram4     time:     4272 Res: 1
Functionis_anagram5     time:    18653 Res: 1
Testing with long strings:
Functionis_anagram      time:    46067 Res: 1
Functionis_anagram1     time:    33597 Res: 1
Functionis_anagram2     time:    52721 Res: 1
Functionis_anagram3     time:    41570 Res: 1
Functionis_anagram4     time:     9027 Res: 1
Functionis_anagram5     time:    15062 Res: 1
Testing with inequal short string:
Functionis_anagram      time:    11096 Res: 0
Functionis_anagram1     time:    10115 Res: 0
Functionis_anagram2     time:    13022 Res: 0
Functionis_anagram3     time:     8066 Res: 0
Functionis_anagram4     time:     2963 Res: 0
Functionis_anagram5     time:     1916 Res: 0
Testing with inequal long string:
Functionis_anagram      time:    44246 Res: 0
Functionis_anagram1     time:    33961 Res: 0
Functionis_anagram2     time:    45004 Res: 0
Functionis_anagram3     time:    38896 Res: 0
Functionis_anagram4     time:     6455 Res: 0
Functionis_anagram5     time:    14603 Res: 0

It is quite clear that the straightforward checking with an array counting up/down based on the occurrence of each character is the winner. I took the original code and removed the find, and just used the fact that a map::operator[] will create a zero value if there is no entry there in is_anagram1, which shows some decent improvement too.

Results are from MY machine, these may or may not reflect the results on other machines, but I doubt that the "winner vs loser" is going to be significantly different.

Edit:

Thought about the "find" operation, and decided to use it in different way:

bool is_anagram0(const std::string & s1, const std::string & s2)
{
    auto check = [](const std::string& x)
    {
        std::map<char, unsigned> counter;
        for(auto const &c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                it->second++;
        }
        return counter;
    };

    return check(s1) == check(s2);
}

Gives a little improvement over the original function, but not better enough to compete with the array solution that gives the best result.

Interesting results! One idea which came to my mind for the `anagram4` and `anagram5` solutions: it might be faster to not loop over the array manually but rather have a second 'static int zeroes[sizeof(arr)/sizeof(*arr)] = { 0 };` array and then do a single `memcmp(arr, zeroes, sizeof(arr)) == 0;` call at the end. The idea being that `memcmp` might have optimizations which compare four (or eight, or more) bytes in a single operation. Maybe it's not worth it for arrays this small though. — Frerich Raabe, Aug 08 '13 at 12:17
@FrerichRaabe: Yes, that helps a tiny bit - more on the "is anagram" than "isn't", because it skips out early on the "isn't". (I only did this for the is_anagram4() - I don't think it's worth it for the 5 variant). — Mats Petersson, Aug 08 '13 at 12:27
@MatsPetersson Wondering why the 26 array length function is slower most of the time than the 256 one, any insights? Seems counter-intuitive to me, and my tests also showed different results... — codeling, Aug 08 '13 at 12:37
The `is_anagram5` variant is slower because it calls tolower - which of course is a bit superfluous. If I remove that, it gets a little faster than the `is_anagram4`. — Mats Petersson, Aug 08 '13 at 12:40
I suppose you should run each algorithm on each sample multiple times and then use the average value. — Frerich Raabe, Aug 08 '13 at 14:00
Yes, but I was being lazy. I ran several times on my machine, and got very similar results. For a GOOD benchmark, there should be a larger sample of inputs too. But the difference is big enough to tell the difference between "worst" and "best" in terms of main algorithm. When you then start tweaking, it may be better to run multiple times. — Mats Petersson, Aug 08 '13 at 14:18
When I did an anagram puzzle lately I had to find all anagrams in a sentence, so did the sorting thing. However, for my sample sentance [sorting was still slower](http://coliru.stacked-crooked.com/view?id=3b1fa9d0bf13325d5d75837b8547cb03-cc73e281b3b6abc9bf6cf4f153b944a6) — Mooing Duck, Aug 08 '13 at 23:49
The last memcmp could be avoided, if one keeps a counter for non-zero elements and tests if the zero/nonzero state of an element toggles. Also one note about the performance evaluation: what happens if the difference is in characters y and z instead of a and b. — Aki Suihkonen, Aug 09 '13 at 04:18
@MatsPetersson: I can't believe I didn't notice there was a bug, I was sorting the strings instead of the content of the strings >.< Interestingly [It had a massive affect on the results](http://coliru.stacked-crooked.com/view?id=454856e22aba546f9851eb320f3301ef-cc73e281b3b6abc9bf6cf4f153b944a6) — Mooing Duck, Aug 09 '13 at 23:02
When the fast-out condition is being exercised, I found that most of the performance difference between 4 and 5 disappeared if I moved the initialisation of `arr` to _after_ the fast-out condition. This reflects poorly on my compiler, but there you go. — sh1, Aug 10 '13 at 00:21
[Your code hacked](http://pastebin.com/SyCEixnw) to read a word list from stdin. Pardon my C++ -- I don't really know the language. — sh1, Aug 10 '13 at 00:56

sds · Answer 4 · 2013-08-09T02:49:27.103

In addition to all the other suggestions, if you are trying to find pairs of anagrams in a set, you will be calling is_anagram on the same arguments (although not the same pairs of arguments) repeatedly.

Most solutions consist of two steps:

map the string arguments to some other object (a sorted string, an array of length 26, a map as in your own solution)
compare these objects to each other

If you have a set of N strings, and you want to find all pairs which are anagrams of each other, you will be calling is_anagram O(N^2) times.

You can save a lot by first computing the step 1 above for each of the N strings and then comparing the results.

codeling · Answer 5 · 2013-08-09T06:37:14.707

I propose a solution which is sorting only one of the two strings:

 bool is_anagram(std::string const & s1, std::string s2)
 {
     if (s1.length() != s2.length()) return false;
     for (int i=0; i<s1.length(); ++i)
     {
         size_t pos = s2.find(s1[i], i);
         if (pos == std::string::npos)
         {
             return false;
         }
         std::swap(s2[i], s2[pos]);
     }
     return true;
 }

This solution could be advantageous in situations where you have one character in the one string which isn't in the other one - because if it doesn't find that character in the other string, it can short-cut the evaluation (in contrast to all other solutions shown here, where always both full strings are evaluated). Especially if this exclusive character on one side occurs early in a long string...

One advantage over the sort solution is also required storage space - the sort solution requires the two strings to be copied, while in my solution only one copy is created. For shorter string lengths (depending on the int type used for counting, but at least up to 256 characters), it also requires less memory than the "count up/down" solution.

For longer strings which are anagrams, on the other hand, it starts to fall behind the "count up/down" solution.

Analysis

See the code & results below. As can easily be seen, my solution (is_anagram3) is quite fast for short anagrams (only beaten by the actually not fully functionally equivalent 26 character count up/down method), and is also fastest for the long non-anagram case with non-matching characters; but tends to get slower than the count up/down methods for longer strings which are anagrams, or for long non-anagrams which just differ by character counts.

Summary

In the end, it would really depend on the expected input data as to what the ideal solution is:

For single calls, the "count up/down" solutions will usually give the best performance in many cases.
Depending on the circumstances, (e.g. with what probability the strings contain different characters, as mentioned above), my solution might be faster
Not tested yet, but seems sure: For the case when you have many anagram checks to perform, and when you implement a cache for the sorted strings, the sorting and comparing solution will become the fastest.

Code:

#include <string>
#include <map>
#include <algorithm>

#include <sys/time.h>
#include <iostream>
#include <iomanip>

bool is_anagram(std::string const & s1, std::string const & s2)
{
    auto check = [](std::string const & x)
    {
        std::map<char, unsigned> counter;
        for(auto const & c : x)
        {
            auto it = counter.find(c);
            if(it == counter.end())
                counter[c] = 1;
            else
                ++counter[c];
        }
        return counter;
    };

    return check(s1) == check(s2);
}

bool is_anagram2(std::string s1, std::string s2)
{
    std::sort(s1.begin(), s1.end());
    std::sort(s2.begin(), s2.end());
    return s1 == s2;
}

bool is_anagram3(std::string const & s1, std::string s2)
{
    if (s1.length() != s2.length()) return false;

    for (int i=0; i<s1.length(); ++i)
    {
        size_t pos = s2.find(s1[i], i);
        if (pos == std::string::npos)
        {
            return false;
        }
        std::swap(s2[i], s2[pos]);
    }
    return true;
}

bool is_anagram4(std::string const & s1, std::string const & s2)
{
    if (s1.length() != s2.length()) return false;

    int count[26] = {0};
    for (int i=0; i<s1.length(); ++i)
    {
        count[s1[i]-'a']++;
        count[s2[i]-'a']--;
    }
    for (int i=0; i<26; ++i)
    {
        if (count[i] != 0) return false;
    }
    return true;
}

bool is_anagram5(std::string const & s1, std::string const & s2)
{
    if (s1.length() != s2.length()) return false;

    int count[256] = {0};
    for (int i=0; i<s1.length(); ++i)
    {
        count[s1[i]]++;
        count[s2[i]]--;
    }
    for (int i=0; i<256; ++i)
    {
        if (count[i] != 0) return false;
    }
    return true;
}

template<typename T>
bool test(const char* name, T func)
{
    if (!func("deal", "lead") || !func("lead", "deal") || !func("dealbreaker", "breakdealer") ||
        !func("dealbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealer") ||
        func("dealbreakerdealbreakerdealbreakerdealbreakera", "breakdealerbreakdealerbreakdealerbreakdealerb") ||
        func("dealxbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealerb") ||
        func("abcdefg", "tuvwxyz") ||
        func("lot", "bloat") || func("lead", "deala") ||
        func("lot", "bloat") || func("deala", "lead") ||
        func("abc", "abcd"))
    {
        std::cout << name << ": impl doesn't work" << std::endl;
        return false;
    }
    return true;
}

template<typename T>
void measure(const char* name, T func,
    std::string const & s1, std::string const & s2)
{
    timeval start,end;
    unsigned long secDiff;
    long usecDiff;

    gettimeofday(&start, 0);
    for (int i=0; i<100000; ++i)
    {
        func(s1, s2);
    }
    gettimeofday(&end, 0);
    secDiff = end.tv_sec - start.tv_sec;
    usecDiff = end.tv_usec - start.tv_usec;
    if (usecDiff < 0)
    {
        secDiff--;
        usecDiff += 1000000;
    }
 std::cout << name << ": " << secDiff << "."<< std::setw(3) << std::setfill('0') << (usecDiff/1000) << " s" << std::endl;
}

template <typename T>
void run(const char * funcName, T func)
{
    std::cout << funcName << std::endl;
    if (!test(funcName, func)) {
        return;
    }
    measure("short", func, "dealbreaker", "breakdealer");
    measure("short no anagram", func, "abcdefg", "tuvwxyz");
    measure("long", func, "dealbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealer");
    measure("long no anagram", func, "dealbreakerdealbreakerdealbreakerdealbreakera", "breakdealerbreakdealerbreakdealerbreakdealerb");
    measure("long no anagram nonmatching char", func, "dealxbreakerdealbreakerdealbreakerdealbreaker", "breakdealerbreakdealerbreakdealerbreakdealerb");
}

int main(int argv, char**argc)
{
    run("is_anagram", is_anagram);
    run("is_anagram2", is_anagram2);
    run("is_anagram3", is_anagram3);
    run("is_anagram4", is_anagram4);
    run("is_anagram5", is_anagram5);
}

Output

Measured on my slow Atom machine, results may vary a bit on different systems, but should in general give a good picture of the relative performances:

is_anagram
short: 2.960 s
short no anagram: 2.154 s
long: 8.063 s
long no anagram: 8.092 s
long no anagram nonmatching char: 8.267 s
is_anagram2
short: 0.685 s
short no anagram: 0.333 s
long: 3.332 s
long no anagram: 3.557 s
long no anagram nonmatching char: 3.740 s
is_anagram3
short: 0.280 s
short no anagram: 0.022 s
long: 0.984 s
long no anagram: 0.994 s
long no anagram nonmatching char: 0.140 s
is_anagram4
short: 0.123 s
short no anagram: 0.071 s
long: 0.399 s
long no anagram: 0.389 s
long no anagram nonmatching char: 0.390 s
is_anagram5
short: 0.283 s
short no anagram: 0.145 s
long: 0.551 s
long no anagram: 0.454 s
long no anagram nonmatching char: 0.455 s

@nightcracker you mean the is_anagram functions would need to be run more often? I disagree. The relative performance is visible quite well, and the run time doesn't vary that much... I can show the results of a second run - the maximum deviation I see in another run is 14 ms (on the is_anagram - long no anagram nonmatching char). That's pretty good accuracy if you ask me... — codeling, Aug 09 '13 at 06:43

score 4 · Answer 6 · edited May 23 '17 at 12:08

Assuming that most word pairs are not anagrams, the case you most need to optimise is the failure case.

Obviously if the lengths are different then the strings cannot be anagrams, and the length is a property that is computed once when the string is created, making it a very efficient thing to compare as a fast-out.

If you change your string class to include more of these properties you can improve the accuracy of the fast-out case. Rather than beginning the test function with:

if (s1.length() != s2.length()) return false;

You could use:

if (s1.hash != s2.hash) return false;

and when you create the string (or after you modify it), you would compute a value for hash which is unique (or nearly unique) to all strings with that set of letters in arbitrary order.

You only compute this hash when a string changes; not for every comparison that you do.

A very simple way to compute the hash is:

long sum = 0;
for (int i = 0; i < s.length(); i++)
    sum += s[i];
s.hash = sum;

this is quick to compute, but is prone to collisions.

A more advanced method:

static const unsigned int primetable[256] = { 1, 3, 5, 7, /* ... */ };

unsigned long prod = 1;
for (int i = 0; i < s.length(); i++)
    prod *= primetable[s[i]];
s.hash = prod;

Note that two is omitted from the list of primes. This ensures that prod is always co-prime with the possible range of unsigned long. This maximises the distribution of results in the face of numerical overflow.

If the table is arranged to put small primes at frequent letter positions, then the cases where overflow occurs (which can lead to hash collisions) can be minimised. If there's no overflow then you have a perfect hash (for these purposes) and you would have absolute certainty both ways (return either true or false immediately) by just comparing hash.

Consequently, computing the hash like this works out much better:

static const unsigned int primetable[256] = {
    1, 1429, 1433, 1439, 1447, 1451, 1453, 1459, 1471, 1481, 1483, 1487, 1489, 1493, 1499, 1511, 1523, 1531, 1543, 1549, 1553, 1559, 1567, 1571, 1579, 1583, 1597, 1601, 1607, 1609, 1613, 1619,
    821, 823, 827, 829, 839, 853, 857, 107, 859, 863, 877, 881, 883, 109, 887, 907, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 911, 919, 929, 937, 941, 947,
    601, 607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683, 691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761, 769, 773, 787, 797, 809, 811,
    359, 11, 61, 31, 37, 3, 71, 43, 59, 7, 101, 79, 29, 53, 13, 23, 47, 103, 17, 5, 19, 41, 73, 83, 89, 67, 97, 367, 373, 379, 383, 389,
    1181, 1187, 1193, 1201, 1213, 1217, 1223, 1229, 1231, 1237, 1249, 1259, 1277, 1279, 1283, 1289, 1291, 1297, 1301, 1303, 1307, 1319, 1321, 1327, 1361, 1367, 1373, 1381, 1399, 1409, 1423, 1427,
    953, 967, 971, 977, 983, 991, 997, 1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049, 1051, 1061, 1063, 1069, 1087, 1091, 1093, 1097, 1103, 1109, 1117, 1123, 1129, 1151, 1153, 1163, 1171,
    397, 401, 409, 419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599,
    173, 179, 181, 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353
};

unsigned long prod = 1;
boolean overflow = false;
for (int i = 0; i < s.length(); i++)
{
    overflow |= (prod > ULONG_MAX / primetable[s[i]]);
    prod *= primetable[s[i]];
}
if (overflow)
    prod ^= 1;
s.hash = prod;

with fast-outs:

if (s1.hash != s2.hash) return false;
if ((s1.hash & 1) != 0) return true;
if (s1.length() != s2.length()) return false;

The middle line is only safe to use if the character encoding is not multi-byte. If you are working with a multi-byte coding scheme then the hash will still eliminate most non-anagrams, but it will have a lot of false positives as some byte orderings cannot be ignored.

Hacking Mats Petersson's test code to read from a dictionary, and trying this and the other algorithms on real English dictionary input we see roughly a factor of four improvement over the next best algorithm (it was a factor of ten, but I tweaked the other code):

Functionis_anagram      time:    218.9s hits: 93
Functionis_anagram1     time:      200s hits: 93
Functionis_anagram2     time:       40s hits: 93
Functionis_anagram3     time:     7.74s hits: 93
Functionis_anagram4     time:     2.65s hits: 93
Functionis_anagram5     time:      2.3s hits: 166
Functionis_anagram6     time:     0.56s hits: 166  <- This method

(the number of hits is different because the last two algorithms are both case-insensitive and my dictionary probably included acronyms colliding with natural words)

update: Although it's not what was asked, it was negligent of me to not point this out. I don't know if I didn't spot it or if I just got sick of typing or if I didn't want to make assumptions about the calling code...

Once you've hashed all the words, a trivial way to minimise the number of tests is to sort the list by that hash. Then you can trivially limit comparisons to parts of the list that are likely matches (according to the hash). This can also make branch prediction more efficient, too.

I just tried changing the N^2 iteration I tested with originally (I'm sure that was deliberately inefficient) to iterate over neighbours in a sorted list. The sort() call dominated the timing, but was 200x faster than the fastest N^2 test, and the choice of comparison algorithm no longer made any meaningful difference to performance.

Or you could just bin words by hash as you receive them.

edo · Answer 7 · 2014-07-22T15:20:01.577

What about this solution? It's in C# if you don't mind.

public bool isAnagram(string first, string second) {
    if(first == null || second == null)
        return false;
    if(first.Length != second.Length)
        return false;
    string characters = "abcd...zABCD...Z";
    int firstResult = 0;
    int secondResult = 0;
    foreach(char x in second) {
        if(first.IndexOf(x) == -1)
            return false;
        if(char == " ")
            secondResult += 0;
        else
            secondResult += characters.IndexOf(x);
    }
    foreach(char x in first) {
        if(char == " ")
            firstResult += 0;
        else
            firstResult += characters.IndexOf(x);
    }
    return firstResult == secondResult;

}

Optimizing very often used anagram function

7 Answers7

Linked