3

In Column 2 of the book Programming Pearls there is a problem asking you to design an algorithm to rotate a string k positions to the left. For example, the string is "12345" and k=2, then the result is "34512".

The first algorithm is to simulate the exchanging process, i.e. put x[(i + k) % n] into x[i], and repeat until finishing.

The second algorithm uses the observation that we only need to exchange the a="12" and b="345", i.e. first k characters and last n - k characters. We could reverse a to a'="21", and b to b'="543' at first, then reverse (a'b')' to ba, which is desired.

Following is my code:

Algorithm 1:

#define NEXT(j) ((j + k) % n)
#define PREV(j) ((j + n - k) % n)

#include "stdio.h"
#include "stdlib.h"

int gcd(int a, int b) {
    return (a % b == 0 ? b : gcd(b, a % b));
}

void solve(int *a, int n, int k) {
    int len = gcd(n, k);
    for (int i = 0; i < len; i++) {
        int x = a[i];
        int j = i;
        do {
            a[j] = a[NEXT(j)];
            j = NEXT(j);
        } while (j != i);
        a[PREV(j)] = x;
    }
}

int main(int argc, char const *argv[])
{
    int n, k;
    scanf("%d %d", &n, &k);

    int *a = malloc(sizeof(int) * n);
    for (int i = 0; i < n; i++) a[i] = i;

    solve(a, n, k);

    free(a);

    return 0;
}

Algorithm 2:

#include "stdio.h"
#include "stdlib.h"

void swap(int *a, int *b) {
    int t = *a;
    *a = *b;
    *b = t;
}

void reverse(int *a, int n) {
    int m = n / 2;
    for (int i = 0; i < m; i++) {
        swap(a + i, a + (n - 1 - i));
    }
}

void solve(int *a, int n, int k) {
    reverse(a, k);
    reverse(a + k, n - k);
    reverse(a, n);
}

int main(int argc, char const *argv[])
{
    int n, k;
    scanf("%d %d", &n, &k);

    int *a = malloc(sizeof(int) * n);
    for (int i = 0; i < n; i++) a[i] = i;

    solve(a, n, k);

    free(a);

    return 0;
}

where n is the length of the string, and k is the length to rotate.

I use n=232830359 and k=80829 to test the two algorithms. The result is, algorithm 1 takes 6.199s while algorithm 2 takes 1.970s.

However, I think the two algorithms both need to compute n exchanges. (Algorithm 1 is obvious, algorithm 2 takes k/2 + (n-k)/2 + n/2 = n exchanges).

My question is, why their speeds differ so much?

HanXu
  • 5,507
  • 6
  • 52
  • 76
  • 1
    It could be cache misses. The second version accesses memory in a nice linear fashion while the first jumps all around. If you want to investigate more you can look at some of the ideas here: http://stackoverflow.com/questions/2486840/linux-c-how-to-profile-time-wasted-due-to-cache-misses. – mrmcgreg Aug 16 '14 at 15:06

1 Answers1

1

Both of this algorithms are more memory bound than CPU bound. That's why it the case when analyzing the number of basic operations(like swaps or loop iterations) gives results that are quite different from the real running time. So we will use external memory model instead of RAM model. That is, we will analyze the number of cache misses. Let's assume that N is an array size, M is the number of blocks in cache and B is one block size. As long as N is big in your test, it safe to assume that N >M(that is, all the array cannot be in cache).

1)The first algorithm: It accesses array elements in the the following manner i, (i + k) mod N, (i + 2 * k) mod N and so on. If k is large, then two consecutively accessed elements are not in the same block. So in the worst case two accesses yield two cache misses. These two blocks will be loaded into cache, but they might not be used for a long time after that! So when they are accessed again, they might be already replaced by other blocks(because the cache is smaller then the array). And it will be a miss again. It can be shown that this algorithm can have O(N) cache misses in the worst case.

2)The second algorithm has very different array access pattern: l, r, l + 1, r - 1, .... If accessing the l-th element causes a miss, the entire block with it is loaded into the cache, so accesses to l + 1, l + 2, ... till the end of the block will not cause any misses. The same is true for r, r - 1 and so on(it is actually true only if l and r blocks can be held in cache at the same time, but this is a safe assumption because caches are usually not direct mapped). So this algorithm has O(N / B) cache misses in the worst case.

Taking into account that a block size of real cache is larger than one integer size, it becomes clear why the second algorithm is significantly faster.

P.S It is just a model of what's really going on, but in this particular case external memory model works better than RAM model(and RAM model is just a model too, anyway).

kraskevich
  • 18,368
  • 4
  • 33
  • 45