53

Has anyone ever seen any numbers/analysis on whether or not use of the C/C++ restrict keyword in gcc/g++ actual provides any significant performance boost in reality (and not just in theory)?

I've read various articles recommending / disparaging its use, but I haven't ran across any real numbers practically demonstrating either sides arguments.

EDIT

I know that restrict is not officially part of C++, but it is supported by some compilers and I've read a paper by Christer Ericson which strongly recommends its usage.

roschach
  • 8,390
  • 14
  • 74
  • 124
Robert S. Barnes
  • 39,711
  • 30
  • 131
  • 179
  • 11
    Aliasing issues are commonly considered the #1 reason why C/C++ are less efficient at many computational tasks than Fortran. So I'd say any feature which helps avoid aliasing can make a *big* difference. – jalf Dec 27 '09 at 23:58
  • possible duplicate of [Realistic usage of the C99 'restrict' keyword?](http://stackoverflow.com/questions/745870/realistic-usage-of-the-c99-restrict-keyword) – Ciro Santilli OurBigBook.com Jun 14 '15 at 10:48

5 Answers5

53

The restrict keyword does a difference.

I've seen improvements of factor 2 and more in some situations (image processing). Most of the time the difference is not that large though. About 10%.

Here is a little example that illustrate the difference. I've written a very basic 4x4 vector * matrix transform as a test. Note that I have to force the function not to be inlined. Otherwise GCC detects that there aren't any aliasing pointers in my benchmark code and restrict wouldn't make a difference due to inlining.

I could have moved the transform function to a different file as well.

#include <math.h>

#ifdef USE_RESTRICT
#else
#define __restrict
#endif


void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n) __attribute__ ((noinline));

void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n)
{
  int i;

  // simple transform loop.

  // written with aliasing in mind. dest, src and matrix 
  // are potentially aliasing, so the compiler is forced to reload
  // the values of matrix and src for each iteration.

  for (i=0; i<n; i++)
  {
    dest[0] = src[0] * matrix[0] + src[1] * matrix[1] + 
              src[2] * matrix[2] + src[3] * matrix[3];

    dest[1] = src[0] * matrix[4] + src[1] * matrix[5] + 
              src[2] * matrix[6] + src[3] * matrix[7];

    dest[2] = src[0] * matrix[8] + src[1] * matrix[9] + 
              src[2] * matrix[10] + src[3] * matrix[11];

    dest[3] = src[0] * matrix[12] + src[1] * matrix[13] + 
              src[2] * matrix[14] + src[3] * matrix[15];

    src  += 4;
    dest += 4;
  }
}

float srcdata[4*10000];
float dstdata[4*10000];

int main (int argc, char**args)
{
  int i,j;
  float matrix[16];

  // init all source-data, so we don't get NANs  
  for (i=0; i<16; i++)   matrix[i] = 1;
  for (i=0; i<4*10000; i++) srcdata[i] = i;

  // do a bunch of tests for benchmarking. 
  for (j=0; j<10000; j++)
    transform (dstdata, srcdata, matrix, 10000);
}

Results: (on my 2 Ghz Core Duo)

nils@doofnase:~$ gcc -O3 test.c
nils@doofnase:~$ time ./a.out

real    0m2.517s
user    0m2.516s
sys     0m0.004s

nils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.c
nils@doofnase:~$ time ./a.out

real    0m2.034s
user    0m2.028s
sys     0m0.000s

Over the thumb 20% faster execution, on that system.

To show how much it depends on the architecture I've let the same code run on a Cortex-A8 embedded CPU (adjusted the loop count a bit cause I don't want to wait that long):

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.c
root@beagleboard:~# time ./a.out

real    0m 7.64s
user    0m 7.62s
sys     0m 0.00s

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c 
root@beagleboard:~# time ./a.out

real    0m 7.00s
user    0m 6.98s
sys     0m 0.00s

Here the difference is just 9% (same compiler btw.)

Georg Schölly
  • 124,188
  • 49
  • 220
  • 267
Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • 3
    Nice work. There is an article on the use of restrict on a Cell processor here: http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html that may be relevant to the discussion architecture specific benefits. – Clifford Dec 27 '09 at 20:31
  • @Nils Pipenbrinck: Why do you have to disable inlining for the function? It seems like an awfully big function for the compiler to automatically inline. – Robert S. Barnes Dec 28 '09 at 07:40
  • 2
    @Nils Pipenbrinck: By the way Ulrich Drepper posted code for a superoptimized matrix multiply as part of his discussion of optimizing cache and memory usage. It's here: http://lwn.net/Articles/258188/ . His discussion of each step he went through to arrive at that solution is here: http://lwn.net/Articles/255364/ . He was able to reduce the execution time by 90% over a standard MM. – Robert S. Barnes Dec 28 '09 at 07:45
  • @Nils Pipenbrinck: I ran your test. When I compiled with -O3 I got the same result as you, about a 20% speed increase. But when I compiled without any optimization flag the two runs where identical. This is interesting: No Op flag = 0m5.022s for both, -O3 0m3.186s & 0m2.583s, -Os 0m2.391s & 0m2.314s. Optimizing for size gives the best result. Adding restrict in this case only bought an extra 3.2% performance. Wonder why that is? – Robert S. Barnes Dec 28 '09 at 07:58
11

Does the restrict keyword provide significant benefits in gcc / g++ ?

It can reduce the number of instructions as shown on the example below, so use it whenever possible.

GCC 4.8 Linux x86-64 exmample

Input:

void f(int *a, int *b, int *x) {
  *a += *x;
  *b += *x;
}

void fr(int *restrict a, int *restrict b, int *restrict x) {
  *a += *x;
  *b += *x;
}

Compile and decompile:

gcc -g -std=c99 -O0 -c main.c
objdump -S main.o

With -O0, they are the same.

With -O3:

void f(int *a, int *b, int *x) {
    *a += *x;
   0:   8b 02                   mov    (%rdx),%eax
   2:   01 07                   add    %eax,(%rdi)
    *b += *x;
   4:   8b 02                   mov    (%rdx),%eax
   6:   01 06                   add    %eax,(%rsi)  

void fr(int *restrict a, int *restrict b, int *restrict x) {
    *a += *x;
  10:   8b 02                   mov    (%rdx),%eax
  12:   01 07                   add    %eax,(%rdi)
    *b += *x;
  14:   01 06                   add    %eax,(%rsi) 

For the uninitiated, the calling convention is:

  • rdi = first parameter
  • rsi = second parameter
  • rdx = third parameter

Conclusion: 3 instructions instead of 4.

Of course, instructions can have different latencies, but this gives a good idea.

Why GCC was able to optimize that?

The code above was taken from the Wikipedia example which is very illuminating.

Pseudo assembly for f:

load R1 ← *x    ; Load the value of x pointer
load R2 ← *a    ; Load the value of a pointer
add R2 += R1    ; Perform Addition
set R2 → *a     ; Update the value of a pointer
; Similarly for b, note that x is loaded twice,
; because x may point to a (a aliased by x) thus 
; the value of x will change when the value of a
; changes.
load R1 ← *x
load R2 ← *b
add R2 += R1
set R2 → *b

For fr:

load R1 ← *x
load R2 ← *a
add R2 += R1
set R2 → *a
; Note that x is not reloaded,
; because the compiler knows it is unchanged
; "load R1 ← *x" is no longer needed.
load R2 ← *b
add R2 += R1
set R2 → *b

Is it really any faster?

Ermmm... not for this simple test:

.text
    .global _start
    _start:
        mov $0x10000000, %rbx
        mov $x, %rdx
        mov $x, %rdi
        mov $x, %rsi
    loop:
        # START of interesting block
        mov (%rdx),%eax
        add %eax,(%rdi)
        mov (%rdx),%eax # Comment out this line.
        add %eax,(%rsi)
        # END ------------------------
        dec %rbx
        cmp $0, %rbx
        jnz loop
        mov $60, %rax
        mov $0, %rdi
        syscall
.data
    x:
        .int 0

And then:

as -o a.o a.S && ld a.o && time ./a.out

on Ubuntu 14.04 AMD64 CPU Intel i5-3210M.

I confess that I still don't understand modern CPUs. Let me know if you:

  • found a flaw in my method
  • found an assembler test case where it becomes much faster
  • understand why there wasn't a difference
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
6

The article Demystifying The Restrict Keyword refers to the paper Why Programmer-specified Aliasing is a Bad Idea (pdf) which says it generally doesn't help and provides measurements to back this up.

Community
  • 1
  • 1
George
  • 61
  • 1
  • 1
  • 2
    There are many kinds of code where it provides little benefit, but there are some where it provides a huge benefit. Are you aware of any papers that would show that judicious use of "restrict" would not offer benefits greater than those compilers can realize through type-based aliasing? – supercat Mar 09 '16 at 00:28
-1

Note that C++ compilers that allow the restrict keyword may still ignore it. That is the case for example here.

Clifford
  • 88,407
  • 13
  • 85
  • 165
  • Actually, if you read down the page you'll notice that while restrict is ignored in C++ because of a potential conflict with user variables of the same name, `__restrict__` is supported for C++. – Robert S. Barnes Dec 27 '09 at 09:59
  • 1
    @Robert: And ignored. The difference is only that identifiers with a double underscore are reserved for system usage. Thus a \_\_restrict\_\_ should not clash with any user declared identifiers. – Martin York Dec 27 '09 at 20:02
  • @Martin: How do you know it's ignored? It's not completely clear from the documentation - seems like you could read it either way. – Robert S. Barnes Dec 28 '09 at 07:35
  • 2
    I agree that it is not clear, but it would seem inconsistent to ignore `restrict` but not `__restrict__`. Either way, it remains non-portable, and beneficial in very specific cases. I suggest that if you know it is beneficial in a particular situation, and you need that benefit to achieve success, then use it; otherwise why make the code gratuitously non-portable? I would not use it habitually, but as a last resort and after testing the actual benefit. – Clifford Dec 28 '09 at 09:32
  • @Clifford: Of course, but it's like that with pretty much any optimization - test this way and that way and then use what works. – Robert S. Barnes Dec 28 '09 at 11:17
-2

I tested this C-Program. Without restrict it took 12.640 seconds to complete, with restrict 12.516. Looks like it can save some time.

raphaelr
  • 67
  • 6
  • That difference is almost certainly insignificant, however, you should also declare c as restricted since each time c is written to at the moment the compiler may be considering that *a *b and *inc might have been changed. – James Dec 27 '09 at 16:17
  • In your example the optimizer can detect that the parameters don't have aliasing. Try to disable inlining and you'll see a bigger difference. – Nils Pipenbrinck Dec 27 '09 at 18:41
  • But if you disable inlining, you're artificially crippling the compiler, so you no longer get an accurate picture of how much `restrict`helps on real-world code. – jalf Dec 28 '09 at 00:00
  • 2
    @raphaelr: It seems like you need to use optimization flags for restrict to be useful. Try either -O3 or -Os. – Robert S. Barnes Dec 28 '09 at 08:11
  • @DrewHall — A difference of 1/8 second may indeed be noise, but it is not necessarily noise. The only way to make that claim with certainty is to know for sure that the measurement was carried out poorly, for example a single run on a loaded system. But for all we know, *raphaelr* averaged the runtimes over a dozen runs, and the 1% difference is valid. Unfortunately, the link he gave is dead now, so there's no way to run an independent test of his code. – Todd Lehman Aug 11 '14 at 00:09
  • Dead link, and no info on what compiler or target architecture this was for, so we can't see if there was a code-gen difference or not. – Peter Cordes Mar 13 '19 at 05:49