C++: Improve existing C program performance

Question

I have the following piece of code that gives the resulting time from multiplying 2 matrix of 1024x1024 fields:

#include <stdio.h> 
#include <stdlib.h> 
#include <time.h>

#define NUM 1024
float a[NUM][NUM],b[NUM][NUM],c[NUM][NUM]; 
void initialize_matrix(float m[][NUM]); 
void load_matrix(float m[][NUM]); 

int main() {

    int i,j,k; 

    clock_t t_inicial,t_final; 
    load_matrix(a); 
    load_matrix(b); 
    initialize_matrix(c); 

    printf("Starting matrix multiplication 1024x1024...\n\n");

    t_inicial=clock();
    for(i=0;i<NUM;i++) 
    for(j=0;j<NUM;j++) 
    for(k=0;k<NUM;k++) 
    c[i][j] =c[i][j] + a[i][k] * b[k][j]; 
    t_final=clock();

    printf("Matrix multiplication finished in: %3.6f seconds",((float) t_final- (float)t_inicial)/CLOCKS_PER_SEC);

} 

void initialize_matrix(float m[][NUM]) {

    int i,j;
    for(i=0;i<NUM;i++) 
    for(j=0;j<NUM;j++) 
    m[i][j]=0.0; 
    return;

} 
void load_matrix(float m[][NUM]) {

    int i,j;
    #pragma omp parallel for
    for(i=0;i<NUM;i++) 
    for(j=0;j<NUM;j++) 
    m[i][j]=(float) 10*rand()/(float) rand(); 
    return;

}

This code takes 24 seconds to solve it, I've been told that there's some problem with the cache memory. How can I improve this code so it takes less time? It's not very nice that it takes so long.

@Adriano that question was made 3 years ago and is not what I'm asking. — user3288852, May 05 '14 at 16:25
@user3288852 It doesn't really matter how old a question here is ... — πάντα ῥεῖ, May 05 '14 at 16:27
@user3288852 I don't think its content is outdated. You'll get a big speed-up even nowadays (3 years ago cache size wasn't around 1 KB). I did understand you want to improve performance of that code (as someone told you because of cache size). Moreover in answers they mention many other things you can do: loop unrolling made by compiler (matrix size is known at compile time) and even SIMD... — Adriano Repetti, May 05 '14 at 16:29
You totally changed question from matrix multiplication to calculation of value of pi? — Öö Tiib, May 05 '14 at 16:47
@user3288852: Please don't substantially change questions like that. if you have a different question, make a new post. — Mooing Duck, May 05 '14 at 16:58
@MooingDuck sorry, I wanted to delete it because I found the solution and it wasn't in the answers and had another doubt so I changed it. Apologies. — user3288852, May 05 '14 at 17:22

Massa · Accepted Answer · 2014-05-05T17:01:21.127

3

Just exchanging

for(j=0;j<NUM;j++) 
for(k=0;k<NUM;k++)

with

for(k=0;k<NUM;k++) 
for(j=0;j<NUM;j++)

I have had a 43x speedup. Like you said, improving cache locality.

Some milliseconds can even be shaven by blocking, i.e., exchanging

for(j=0;j<NUM;j++) 
  for(j=0;j<NUM;j++) 
    for(k=0;k<NUM;k++)

by

for(int i0=0; i0<NUM; i0+=BLK)
  for(int k0=0; k0<NUM; k0+=BLK)
    for(int j0=0; j0<NUM; j0+=BLK)
      for(int i=i0, ix=i0+BLK; i<ix; ++i)
        for(int k=k0, kx=k0+BLK; k<kx; ++k)
          for(int j=j0, jx=j0+BLK; j<jx; ++j)

(my best runs were with #define BLK 256, but YMMV).

CLARIFYING: this is the answer given the link referenced by @Adriano, and you really should have looked it before editing the question.

edited May 05 '14 at 17:01

answered May 05 '14 at 16:47

Massa

8,647
2
25
26

1

As an experiment, I wrapped the block iteration in an iterator, but I make no claims as to performance or readability: http://coliru.stacked-crooked.com/a/be0367ca298990a8 – Mooing Duck May 05 '14 at 18:08
1

Now generalized to N counters (currently topped at 6), with no template recursion (Altough the benefits of template recursion probably outweigh the cons): http://coliru.stacked-crooked.com/a/32a6e4aabce94f8e – Mooing Duck May 05 '14 at 20:12

score 1 · Answer 2 · edited May 23 '17 at 10:32

It depends upon the compiler that you are using. With GCC you could use __builtin_prefetch, of course after having asked for optimization (compiling with gcc -O3 -mtune=native). Use it carefully after benchmarking, see this answer (e.g. in the i or j loop, to fetch the next row).

Your code is very regular, so could be using OpenMP directives to the compiler. And you could even consider coding some OpenCL kernel to take advantage of your GPGPU.

score 1 · Answer 3 · answered May 05 '14 at 16:32

A straightforward implementation of matrix multiply is not very cache friendly. The comment that you were told is probably referring to blocking, which will do the multiplication in blocks to improve locality. Here is one reference. If you Google for "cache block matrix multiplication" you will get other hits.

C++: Improve existing C program performance

3 Answers3