running speed of permutation function using different methods results in unexpected results

Question

I have implemented a isPermutation function which given two string will return true if the two are permutation of each other, otherwise it will return false.

One uses c++ sort algorithm twice, while the other uses an array of ints to keep track of string count.

I ran the code several times and every time the sorting method is faster. Is my array implementation wrong?

Here is the output:

1
0
1
Time: 0.088 ms
1
0
1
Time: 0.014 ms

And the code:

#include <iostream> // cout
#include <string>   // string
#include <cstring> // memset
#include <algorithm> // sort
#include <ctime> // clock_t

using namespace std;

#define MAX_CHAR 255


void PrintTimeDiff(clock_t start, clock_t end) {
    std::cout << "Time: " << (end - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
}


// using array to keep a count of used chars
bool isPermutation(string inputa, string inputb) {
    int allChars[MAX_CHAR];
    memset(allChars, 0, sizeof(int) * MAX_CHAR);

    for(int i=0; i < inputa.size(); i++) {
        allChars[(int)inputa[i]]++;
    }

    for (int i=0; i < inputb.size(); i++) {
        allChars[(int)inputb[i]]--;
        if(allChars[(int)inputb[i]] < 0) {
            return false;
        }   
    }

    return true;
}


// using sorting anc comparing
bool isPermutation_sort(string inputa, string inputb) {

    std::sort(inputa.begin(), inputa.end());
    std::sort(inputb.begin(), inputb.end());

    if(inputa == inputb) return true;
    return false;
}



int main(int argc, char* argv[]) {

    clock_t  start = clock();
    cout << isPermutation("god", "dog") << endl;
    cout << isPermutation("thisisaratherlongerinput","thisisarathershorterinput") << endl;
    cout << isPermutation("armen", "ramen") << endl;
    PrintTimeDiff(start, clock());


    start = clock();
    cout << isPermutation_sort("god", "dog") << endl;
    cout << isPermutation_sort("thisisaratherlongerinput","thisisarathershorterinput") << endl;
    cout << isPermutation_sort("armen", "ramen") << endl;
    PrintTimeDiff(start, clock());

    return 0;
}

With your first function, would isPermutation("aa", "a"); return true? perhaps you should check if their size are equal? — matt, Mar 14 '16 at 06:09
also, what results do you get if you flip the order you test the functions? testing isPermutation_sort first? — matt, Mar 14 '16 at 06:16
Why would you consider your array implementation broken? By what definition of broken? — Ulrich Eckhardt, Mar 14 '16 at 06:55
Your performance test is invalid, as the two functions produce different results: the `sort` one checks if two strings are permutations of each other, the `array` - only that the first string contains all characters of the second (hint: it may have other characters and still return `true`). — Vlad Feinstein, Mar 15 '16 at 14:05
@matt, that is true it would return true even though it's not a permutation. so in reality the array version is incorrect — ArmenB, Apr 16 '16 at 05:14
@UlrichEckhardt, using array it should run faster because it's not performing any sorting function twice. so array method is really O(n) and the sorting method most likely something like O(nlog(n)) — ArmenB, Apr 16 '16 at 05:17
Add that to your question to clarify it. However, your assumption is flawed: O(n) is not faster than O(n log n)! The only claim is that for all values of n greater than some minimum, one is faster than the other, but for values less than this limit no claims are made. With your tiny strings, you don't prove anything. Instead, auto-generate strings with larger sizes and random content. That way, you could also write tests to locate and fix the remaining bugs in your array-based function. — Ulrich Eckhardt, Apr 16 '16 at 10:07

score 5 · Answer 1 · answered Mar 14 '16 at 06:45

To benchmark this you have to eliminate all the noise you can. The easiest way to do this is to wrap it in a loop that repeats the call to each 1000 times or so, then only spit out the value every 10 iterations. This way they each have a similar caching profile. Throw away values that are bogus due (eg blowouts due to context switches by the OS).

I got your method marginally faster by doing this. An excerpt.

method 1 array Time: 0.768 us
method 2 sort Time: 0.840333 us

method 1 array Time: 0.621333 us
method 2 sort Time: 0.774 us

method 1 array Time: 0.769 us
method 2 sort Time: 0.856333 us

method 1 array Time: 0.766 us
method 2 sort Time: 0.850333 us

method 1 array Time: 0.802667 us
method 2 sort Time: 0.89 us

method 1 array Time: 0.778 us
method 2 sort Time: 0.841333 us

I used rdtsc which works better for me on this system. 3000 cycles per microsecond is close enough for this, but please do make it more accurate if you care about precision of the readings.

#if defined(__x86_64__)
static uint64_t rdtsc()
{
    uint64_t    hi, lo;

    __asm__ __volatile__ (
                            "xor %%eax, %%eax\n"
                            "cpuid\n"
                            "rdtsc\n"
                            : "=a"(lo), "=d"(hi)
                            :: "ebx", "ecx");

    return (hi << 32)|lo;
}
#else
#error wrong architecture - implement me
#endif

void PrintTimeDiff(uint64_t start, uint64_t end) {
    std::cout << "Time: " << (end - start)/double(3000)  << " us" << std::endl;
}

Each time I see something like this I feel respect to people who write this type of code for their wizardry. — Patryk, Mar 14 '16 at 10:00
I am reapplying to get my bachelors in CS again - right now. — ArmenB, Mar 14 '16 at 23:11
That `rdtsc` function will work correctly for 32bit, too, so you should add `|| defined(__i386__)` — Peter Cordes, Mar 16 '16 at 20:00

score 4 · Accepted Answer · edited May 23 '17 at 12:07

you cannot check performance differences between implementations putting in the mix calls to std::cout. isPermutation and isPermutation_sort are some order of magnitude faster than a call to std::cout (and, anyway, prefer \n over std::endl).
for testing you have to activate compiler optimizations. Doing so the compiler will apply the loop-invariant code motion optimization and you'll probably get the same results for both implementations.

A more effective way of testing is:

int main()
{
  const std::vector<std::string> bag
  {
    "god", "dog", "thisisaratherlongerinput", "thisisarathershorterinput",
    "armen", "ramen"
  };

  static std::mt19937 engine;
  std::uniform_int_distribution<std::size_t> rand(0, bag.size() - 1);

  const unsigned stop = 1000000;

  unsigned counter = 0;
  std::clock_t start = std::clock();
  for (unsigned i(0); i < stop; ++i)
    counter += isPermutation(bag[rand(engine)], bag[rand(engine)]);

  std::cout << counter << '\n';
  PrintTimeDiff(start, clock());

  counter = 0;
  start = std::clock();
  for (unsigned i(0); i < stop; ++i)
    counter += isPermutation_sort(bag[rand(engine)], bag[rand(engine)]);

  std::cout << counter << '\n';
  PrintTimeDiff(start, clock());

  return 0;
}

I have 2.4s for isPermutations_sort vs 2s for isPermutation (somewhat similar to Hal's results). Same with g++ and clang++.

Printing the value of counter has the double benefit of:

triggering the as-if rule (the compiler cannot remove the for loops);
allowing a first check of your implementations (the two values cannot be too distant).

There're some things you have to change in your implementation of isPermutation:

pass arguments as const references
```
bool isPermutation(const std::string &inputa, const std::string &inputb)
```
just this change brings the time down to 0.8s (of course you cannot do the same with isPermutation_sort).
you can use std::array and std::fill instead of memset (this is C++ :-)
avoid premature pessimization and prefer preincrement. Only use postincrement if you're going to use the original value
do not mix signed and unsigned value in the for loops (inputa.size() and i). i should be declared as std::size_t
even better, use the range based for loop.

So something like:

bool isPermutation(const std::string &inputa, const std::string &inputb)
{
  std::array<int, MAX_CHAR> allChars;
  allChars.fill(0);

  for (auto c : inputa)
    ++allChars[(unsigned char)c];

  for (auto c : inputb)
  {
    --allChars[(unsigned char)c];
    if (allChars[(unsigned char)c] < 0)
      return false;
  }

  return true;
}

Anyway both isPermutation and isPermutation_sort should have this preliminary check:

  if (inputa.length() != inputb.length())
    return false;

Now we are at 0.55s for isPermutation vs 1.1s for isPermutation_sort.

Last but not least consider std::is_permutation:

for (unsigned i(0); i < stop; ++i)
{
  const std::string &s1(bag[rand(engine)]), &s2(bag[rand(engine)]);

  counter += std::is_permutation(s1.begin(), s1.end(), s2.begin());
}

(0.6s)

EDIT

As observed in BeyelerStudios' comment Mersenne-Twister is too much in this case.

You can change the engine to a simpler one.:

static std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647> engine;

This further lowers the timings. Luckily the relative speeds remain the same.

Just to be sure I've also checked with a non random access scheme obtaining the same relative results.

consider the overhead of evaluating a Mersenne-Twister twice per evaluation: for `std::is_permutation` it can be more than 200% ([ideone.com](http://ideone.com/4Rpcwe): 0.17 vs 0.05 sec) — BeyelerStudios, Mar 14 '16 at 10:29
@BeyelerStudios Thank you for your remark. It's true Mersenne-Twister is too complex and a `std::linear_congruential_engine` is enough. But even so there're two calls to `rand(engine)` for both `std::is_permutation` and `isPermutation` and that's fair. I've edited the answer. — manlio, Mar 14 '16 at 10:58
@BeyelerStudios ...and `s1` and `s2` had to be simple const references... now fixed. — manlio, Mar 14 '16 at 11:05
You missed one of the OP's bugs: `++allChars[c];` uses `c` as a signed index. (`char` happens to be signed in the Linux x86-64 ABI, but of course any code which depends on that one way or another is non-portable). See my answer. Casting to `(unsigned char)` is necessary. Also, do you have any insight into why clang fails to optimize away some of the overhead of `string::operator[]` in the OP's version, but it's fine in your version with `for (auto c : inputa)`? — Peter Cordes, Mar 16 '16 at 21:42
@PeterCordes Thank you very much, you're completely right. Now it should be fixed. — manlio, Mar 16 '16 at 22:34

score 3 · Answer 3 · edited Jun 20 '20 at 09:12

Your idea amounts to using a Counting Sort on both strings, but with the comparison happening on the count array, rather than after writing out sorted strings.

It works well because a byte can only have one of 255 non-zero values. Zeroing 256B of memory, or even 4*256B, is pretty cheap, so it works well even for fairly short strings, where most of the count array isn't touched.

It should be fairly good for very long strings, at least in some cases. It's pretty heavily dependent on a good and a heavily pipelined L1 cache, because scattered increments to the count array produces scattered read-modify-writes. Repeated occurrences create a dependency chain of with a store-load round-trip in it. This is a big glass-jaw for this algorithm, on CPUs where many loads and stores can be in flight at once (with their latencies happening in parallel). Modern x86 CPUs should run it pretty well, since they can sustain a load + store every clock cycle.

The initial count of inputa compiles to a very tight loop:

.L15:
        movsx   rdx, BYTE PTR [rax]
        add     rax, 1
        add     DWORD PTR [rsp-120+rdx*4], 1
        cmp     rax, rcx
        jne     .L15

This brings us to the first major bug in your code: char can be signed or unsigned. In the x86-64 ABI, char is signed, so allChars[(int)inputa[i]]++; sign-extends it for use as an array index. (movsx instead of movzx). Your code will write outside the array bounds on non-ASCII characters that have their high bit set. So you should have written allChars[(unsigned char)inputa[i]]++;. Note that casting to (unsigned) doesn't give the result we want (see comments).

Note that clang makes much worse code (v3.7.1 and v3.8, both with -O3), with a function call to std::basic_string<...>::_M_leak_hard() inside the inner loop. (Leak as in leak a reference, I think.) @manlio's version doesn't have this problem, so I guess for (auto c : inputa) syntax helps clang figure out what's happening.

Also, using std::string when your callers have char[] forces them to construct a std::string. That's kind of silly, but it is helpful to be able to compare string lengths.

GNU libc's `std::is_permutation` uses a very different strategy:

First, it skips any common prefix that's identical without permutation in both strings.

Then, for each element in inputa:

count the occurrences of that element in inputb. Check that it matches the count in inputa.

There are a couple optimizations:

Only compare counts the first time an element is seen: find duplicates by searching from the beginning of inputa, and if the match position isn't the current position, we've already checked this element.
check that the match count in inputb is != 0 before counting matches in the rest of inputa.

This doesn't need any temporary storage, so it can work when the elements are large. (e.g. an array of int64_t, or an array of structs).

If there is a mismatch, this is likely to find it early, before doing as much work. There are probably a few cases of inputs where the counting version would take less time, but probably for most inputs the library algorithm is best.

std::is_permutation uses std::count, which should be implemented very well with SSE / AVX vectors. Unfortunately, it's auto-vectorized in a really stupid way by both gcc and clang. It unpacks bytes to 64bit integers before accumulating them in vector elements, to avoid overflow. So it spends most of its instructions shuffling data around, and is probably slower than a scalar implementation (which you'd get from compiling with -O2, or with -O3 -fno-tree-vectorize).

It could and should only do this every few iterations, so the inner loop of count can just be something like pcmpeqb / psubb, with a psadbw every 255 iterations. Or pcmpeqb / pmovmskb / popcnt / add, but that's slower.

Template specializations in the library could help a lot for std::count for 8, 16, and 32bit types whose equality can be checked with bitwise equality (integer ==).

Very interesting analysis. Unfortunately there isn't a +2 button. — manlio, Mar 16 '16 at 22:37
@sign-extension: *[...] the resulting value is the smallest unsigned value **equal to the source value modulo 2^n** [...] That is [for two's complement] signed integers are sign-extended [...]* - [cppreference](http://en.cppreference.com/w/cpp/language/implicit_cast#Numeric_conversions). — BeyelerStudios, Mar 19 '16 at 06:02
@BeyelerStudios: Ahhh, I see. So casting back to a `signed int` would give the same integer that the `char` originally had. Of course it works that way; it would actually be weird if `(unsigned) some_char` was the same as `(unsigned) (unsigned char)some_char`. — Peter Cordes, Mar 19 '16 at 06:12

running speed of permutation function using different methods results in unexpected results

3 Answers3

GNU libc's std::is_permutation uses a very different strategy:

GNU libc's `std::is_permutation` uses a very different strategy: