1

My question is simple: what is the most efficient way to normalize a 2d double array in c, so that its column (or its row) sum to 1. Below is a simple example to illustrate what I want to do. It's OK to reshape the array to either normalize its row or its column, if that matters. I am also open to use external c linear algebra library. Just not sure where to start. Thanks for any help in advance!

void normalize(double** array, int nrow) {
    int i;
    double sum = 0;
    for(i = 0; i < nrow; i++) {
        sum += array[i][0];
    }
    for(i = 0; i < nrow; i++) {
        array[i][0] /= sum;
    }
}

By the way, this is part of a hidden Markov model dynamic programming algorithm, and it's called many times. So I want to make this part as efficient as possible.

qkhhly
  • 1,150
  • 3
  • 12
  • 27
  • most efficient by what measure? What is wrong with the example given? If I can't prove that my answer is somehow the most efficient possible ever (you don't even give the platform and architecture so I don't know how someone could possibly know what is most efficient for you), do you not want it? – xaxxon Jul 23 '13 at 02:46
  • Well, I am just wondering if there is better way to do it, as measured by the time elapsed. The code will run on linux and intel xeon. – qkhhly Jul 23 '13 at 02:55
  • store a hash of every possible combination and look up the normalized value. It takes an almost infinite amount of memory, of course. I mean.. there's probably some SIMD instructions you can use, but is it worth going that far (also, that requires knowing WHICH xeon platform)? What's your goal? Why do you think you need something faster? Have you profiled your code? – xaxxon Jul 23 '13 at 02:56
  • Sure if there is no quick better way, it's probably not worth it. This piece of code takes about a fifth of the running time. And the other parts of the code I believe is harder to speed up. But if you have a large amount of data, and you want to train the model through iterations (EM or MCMC), a fifth of the time is still a lot. – qkhhly Jul 23 '13 at 03:10
  • I'd start reading here: http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions I'm guessing your second loop could probably benefit quite a bit from some SIMD love -- but I've never done anything like that. Also, it looks like maybe using the intel compiler might do some of this for you automagically – xaxxon Jul 23 '13 at 03:19
  • I'd recommend tagging your question with "simd" and see if anyone there can help you - but you'd have to put in a little effort first to try it and say what's not working – xaxxon Jul 23 '13 at 03:22
  • You may want to look at this question, as well: http://stackoverflow.com/questions/17761154/sse-reduction-of-float-vector – xaxxon Jul 23 '13 at 03:24
  • Thanks! I'll take a look. And also thanks for the reminders. I probably should not waste 80% of my time on the 20% improvement. :) – qkhhly Jul 23 '13 at 03:28

1 Answers1

1

I don't know how closely your sample code matches your application, but if you are looping over rows like that, you are almost certainly running into cache problems. If I code your loops in row-major and column-major order, I see drastic performance differences.

With nrow=1000000 and ncol=1000, if I use array[i][0], I get a runtime of about 1.9 s. If I use array[0][i], then it drops to 0.05s.

If it's possible for you to transpose your data in this way, you should see a large performance boost.

#ifdef COL_MAJOR    

    array = (double **)malloc(nrow * sizeof(double *));
    for(i=0; i<nrow; i++) {
        array[i] = (double *)malloc(ncol * sizeof(double));
        array[i][0] = i;
    }

    for(i=0; i<nrow; i++) {
        sum += array[i][0];
    }
    for(i=0; i<nrow; i++) {
        array[i][0] /= sum;
    }

#else

    array = (double **)malloc(ncol * sizeof(double *));
    for(i=0; i<ncol; i++) {
        array[i] = (double *)malloc(nrow * sizeof(double));
    }
    for(i=0; i<nrow; i++) {
        array[0][i] = i;
    }

    for(i=0; i<nrow; i++) {
        sum += array[0][i];
    }
    for(i=0; i<nrow; i++) {
        array[0][i] /= sum;
    }

#endif

printf("%f\n", sum);
$ gcc -DCOL_MAJOR -O2 -o normed normed.c
$ time ./normed
499999500000.000000

real    0m1.904s
user    0m0.325s
sys 0m1.575s
$ time ./normed
499999500000.000000

real    0m1.874s
user    0m0.304s
sys 0m1.567s
$ time ./normed
499999500000.000000

real    0m1.873s
user    0m0.296s
sys 0m1.573s
$ gcc -O2 -o normed normed.c
$ time ./normed
499999500000.000000

real    0m0.051s
user    0m0.017s
sys 0m0.024s
$ time ./normed
499999500000.000000

real    0m0.050s
user    0m0.017s
sys 0m0.023s
$ time ./normed
499999500000.000000

real    0m0.051s
user    0m0.014s
sys 0m0.022s
$ 
Nigel
  • 354
  • 2
  • 5
  • I guessed there will be difference between row normalization and column normalization, but thanks for the benchmark! – qkhhly Jul 23 '13 at 13:51