1

I have a longer code that was cut down as much as possible while keeping the issue alive. My code runs an MCMC computation for different parameter values. For some combinations of values, the code runs much longer, about 100x slower than on typical cases. However, it shouldn't, because the number of operations does not depend on parameter values.

I am running this on an AMD64 Linux box with glibc-2.17 compiled with GCC 4.8.1 on Gentoo. Compile flags do not matter as it appears. I also tested it on a different Gentoo box with an older AMD64 processor and the results were the same.

There were a bunch of tests I did:

  • I have tried debugging with Valgrind and it found no memory issues or other nasty things.
  • Secondly, I have tried running the code with the problematic parameter values fixed, but the slowness was not encountered.
  • I have tried putting sleep(4) between iterations, but nothing changed.

The problem shows when the iteration hits k = 1, i = j = 0, which translates into mu[0] = -0.05, mu[1] = -0.05 and mu[2] = 0.05. As I said, using this fixed value for all iterations eliminates the issue I am seeing.

Here are things that eliminate the problem:

  • Changing the limits.
  • Fixing the mu[] coefficients.
  • Removing dW3 from the computation.
  • Removing rand().
  • Removing computation of q.
  • Removing the update of s[j].

I have read quit a bit about slowpow, thus tried eliminating exp by writing my own version of it. This solves the problems I am having with this MWE, but not when the reimplemented exp is placed in the production code.

Question: what is causing the semi-random slowness?

Code of the MWE follows. All help and suggestions as how to proceed would be greatly appreciated.

Note: This code is compiled with g++ although it is essentially C. Changing compiler doesn't change anything.

Regarding branch prediction: Removing one of the if statements by using

q = exp(dW);        
q = q / (1.0 + q); 

no matter what is the value of dW doesn't change code's behavior; if this indeed is due to branch prediction, it would have to be due to the second if.


#include <cstdio>
#include <cstdlib>
#include <cmath>

inline int index(int const i, int const j, int const n)
{
    return (i + n) % n + ((j + n) % n) * n;
}

void get_sample(int* s, int n, double* mu)
{
    for (int i = 0; i < 10 * n * n; i++)
    {
        int j = i % (n * n); 
        int x = j % n;
        int y = (j - x) / n;

        double dW1 = mu[0] * (s[index(x - 1, y, n)] + s[index(x + 1, y, n)] + s[index(x, y - 1, n)] + s[index(x, y + 1, n)]);
        double dW2 = mu[1] * (s[index(x - 1, y - 1, n)] + s[index(x + 1, y - 1, n)] + s[index(x + 1, y + 1, n)] + s[index(x - 1, y + 1, n)]);
        double dW3 = mu[2] * (s[index(x - 1, y, n)] * s[index(x - 1, y - 1, n)] * s[index(x, y - 1, n)] + s[index(x - 1, y, n)] * s[index(x - 1, y + 1, n)] * s[index(x, y + 1, n)]
                                        + s[index(x, y + 1, n)] * s[index(x + 1, y + 1, n)] * s[index(x + 1, y, n)] + s[index(x + 1, y, n)] * s[index(x + 1, y - 1, n)] * s[index(x, y - 1, n)]);

        double dW = 2.0 * (dW1 + dW2 + dW3);

        double q;
        if (dW < 0.0)
        {   
            q = exp(dW);

            q = q / (1.0 + q); 
        }
        else
        {
            q = exp(-dW);

            q = 1.0 / (1.0 + q);
        } 

        double p = ((double) rand()) / ((double) RAND_MAX);

        if (p < q)
        {
            s[j] = 1;
        }
        else
        {
            s[j] = -1;
        }
    }
}

int main(int argc, char** argv)
{
    double mu[3];

    double limits[6] = {-0.05, 0.8, -0.05, 0.45, -0.45, 0.05};

    int s[16];

    for (int i = 0; i < 16; i++)
    {
        s[i] = -1;
    }

    for (int k = 0; k < 2; k++)
    {
        for (int j = 0; j < 2; j++)
        {
            for (int i = 0; i < 2; i++)
            {               
                mu[0] = limits[0] + ((limits[1] - limits[0]) * i);
                mu[1] = limits[2] + ((limits[3] - limits[2]) * j);
                mu[2] = limits[4] + ((limits[5] - limits[4]) * k);

                printf(" Computing (% .6lf, % .6lf, % .6lf)...\n", mu[0], mu[1], mu[2]);

                for (int sample = 0; sample < 1000; sample++)
                {
                    get_sample(s, 4, mu);
                }                           
            }
        }
    }               

    return 0;
}
ANSI C Mastah
  • 149
  • 1
  • 6
  • why call calloc() when you set each value to -1 straight after? – Mitch Wheat Nov 18 '13 at 01:39
  • 1
    BTW: having 'Master' in your name (even with the 'gangsta' spelling) might prejudice people from answering. – Mitch Wheat Nov 18 '13 at 01:41
  • `calloc()` is a result of this code coming from production. Changing.. – ANSI C Mastah Nov 18 '13 at 01:41
  • @MitchWheat: it's a joke, hence the 'gangsta' spelling. – ANSI C Mastah Nov 18 '13 at 01:43
  • Is it more consistent speed-wise if you comment out the if statements (or make them if(1) instead of the real condition)? Could be branch prediction hitting a pathological case; the things you remove to solve it seem related to the if()s too.... – Adam D. Ruppe Nov 18 '13 at 01:46
  • Since `get_sample` has two if/else, could [branch prediction](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array/11227902#11227902) be responsible for the difference? – Christian Garbin Nov 18 '13 at 01:49
  • Yes, replacing `if(...)` with `if(1)` does eliminate this issue, I can confirm that. However, why using fixed `mu`, rather than iterating through values, would cause problems? – ANSI C Mastah Nov 18 '13 at 01:54
  • Is it something to do with cache locality? Are you hitting the arrays in a worse sequence, and incurring performance hits as a result? It isn't at all clear to me whether the array access pattern varies as a result of the parameters, so I don't expect this to be the answer, but I thought it worth mentioning as something for you to check. – Jonathan Leffler Nov 18 '13 at 02:03
  • @ANSICMastah Please read about https://www.google.de/search?q=branch+prediction – j_kubik Nov 18 '13 at 02:06
  • Array is very small (16 integers), so not sure if cache is an issue. The access pattern is pre-determined, actually, so is no longer random: you get `j = i % 16`, so you go along the array, and you visit the entire array 10 times. The `index` simply tells the code to access values that surround the current one on a 4x4 periodic lattice. – ANSI C Mastah Nov 18 '13 at 02:06
  • If all the contents of s[] are the same, without the mu changing, the values being compared always give the same result too (if I'm reading the code right at least), which is again friendly to the branch predictor. – Adam D. Ruppe Nov 18 '13 at 02:12
  • @AdamD.Ruppe: contents of `s[]` change due to the `if` at the end of `get_sample`. In fact, when `mu` are small, they change a lot. – ANSI C Mastah Nov 18 '13 at 02:18

2 Answers2

3

However, it shouldn't, because the number of operations does not depend on parameter values.

The speed of floating point operations does depend on parameter values, though. If you introduce NaN or other exceptional values in your computation (which I didn't review the code for) it will drastically degrade your floating point performance.

EDIT: I manually profiled (with simple rdtsc counting) around the exp() and it was easy to bin "good" and "bad" cases. When I printed the bad cases it was all where dW ~= 0. If you split that case out you get even performance:

    double q;
    if (dW < -0.1e-15)
    {
        q = exp(dW);

        q = q / (1.0 + q);
    }
    else if (dW > 0.1e-15)
    {
        q = exp(-dW);

        q = 1.0 / (1.0 + q);
    }
    else
    {
        q = 0.5;
    }
Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • Yes, but there are no pathological numbers as far as I have found. `std::isfinite` never reported problematic values, though I am aware that [certain values do cause slowness](http://entropymine.com/imageworsener/slowpow/). Secondly, fixing the problematic `mu` parameters throughout the iteration removes the problem. – ANSI C Mastah Nov 18 '13 at 01:57
  • @ANSICMastah: I think the next step is instruction level profiling with something like VTune or oprofile. That would narrow it down to the exact operation. Unfortunately the only Linux system I have easy access to is a VM and I can't get at the performance counters. – Ben Jackson Nov 18 '13 at 02:32
  • I tried using `gprof`, but sadly it doesn't tell me which parts are slow during those abnormal iterations. – ANSI C Mastah Nov 18 '13 at 02:50
  • Good catch, thanks! What is troubling is the fact that fixing `mu` during the iteration didn't show this. – ANSI C Mastah Nov 18 '13 at 02:59
  • 1
    @ANSICMastah Try looking for denormal values. Very small numbers are one of the usual suspects. It's likely you can trap them and use a linear or constant approximation to the overall function. – Potatoswatter Nov 18 '13 at 03:09
0

If I am right and branch prediction is the problem, you should try

void get_sample(int* s, int n, double* mu)
{
    for (int i = 0; i < 10 * n * n; i++)
    {
        int j = i % (n * n); 
        int x = j % n;
        int y = (j - x) / n;

        double dW1 = mu[0] * (s[index(x - 1, y, n)] + s[index(x + 1, y, n)] + s[index(x, y - 1, n)] + s[index(x, y + 1, n)]);
        double dW2 = mu[1] * (s[index(x - 1, y - 1, n)] + s[index(x + 1, y - 1, n)] + s[index(x + 1, y + 1, n)] + s[index(x - 1, y + 1, n)]);
        double dW3 = mu[2] * (s[index(x - 1, y, n)] * s[index(x - 1, y - 1, n)] * s[index(x, y - 1, n)] + s[index(x - 1, y, n)] * s[index(x - 1, y + 1, n)] * s[index(x, y + 1, n)]
                                        + s[index(x, y + 1, n)] * s[index(x + 1, y + 1, n)] * s[index(x + 1, y, n)] + s[index(x + 1, y, n)] * s[index(x + 1, y - 1, n)] * s[index(x, y - 1, n)]);

        double dW = 2.0 * (dW1 + dW2 + dW3);

        double q;
        q = exp(dW *((dW>0)*2-1);

        q = ((dW>0)*q + (dW<=0)) / (1.0 + q); 

        double p = ((double) rand()) / ((double) RAND_MAX);

        s[j] = (p<q)*2-1;
    }
}

I am also wondering if a good compiler shouldn't make such transformations anyway...

j_kubik
  • 6,062
  • 1
  • 23
  • 42
  • Moments before your answer I eliminated both `if` statements, the latter in the same way as you did. Nothing has changed, interestingly enough. – ANSI C Mastah Nov 18 '13 at 02:22
  • BTW - are you timing optimized version of your code? What compiler do you use? – j_kubik Nov 18 '13 at 02:24
  • In my experience, people expect a lot more optimizations out of their compilers than compilers generally provide. – Crashworks Nov 18 '13 at 02:25
  • @Crashworks - True enough, but optimization is not supposed to be better than we expect, it should just as we would expect. Code produced without it is usually full of pointless instructions. – j_kubik Nov 18 '13 at 02:28
  • @j_kubik: As I wrote, GCC 4.8.1; I did use -O3, -O2, no optimizations, the result is always the same. I used `gettimeofday` to time it initially, but it is actually clearly visible just from the `printf`'s. – ANSI C Mastah Nov 18 '13 at 02:29
  • I timed your code (I increased your sample count to 100000, for default 1000 your program was just flashing) and for my machine worst case you described was only slightly slower - about 10% from the rest. After a moment of experimentation I noticed a funny thing - Although execution times for each mu parameter set are spread a bit (which at first suggests system interrupting process randomly being the cause), the exact same spread is observed on every program run. – j_kubik Nov 18 '13 at 03:07