10

Knowing the number of iteration a loop will go through allows the compiler to do some optimization. Consider for instance the two loops below :

Unknown iteration count :

static void bitreverse(vbuf_desc * vbuf)
{
    unsigned int idx = 0;
    unsigned char * img = vbuf->usrptr;

    while(idx < vbuf->bytesused) {
        img[idx] = bitrev[img[idx]];
        idx++;
    }

}

Known iteration count

static void bitreverse(vbuf_desc * vbuf)
{
    unsigned int idx = 0;
    unsigned char * img = vbuf->usrptr;

    while(idx < 1280*400) {
        img[idx] = bitrev[img[idx]];
        idx++;
    }

}

The second version will compile to faster code, because it will be unrolled twice (on ARM with gcc 4.6.3 and -O2 at least). Is there a way to make assertion on the loop count that gcc will take into account when optimizing ?

shodanex
  • 14,975
  • 11
  • 57
  • 91
  • BTW what do you mean by unrolled twice? – sr01853 Jan 10 '13 at 10:53
  • You can always pass `-funroll-all-loops` for each module where you know it helps. – Fred Foo Jan 10 '13 at 10:58
  • @Sibi Probably that the code in the loop body will be emitted twice, and the counter halved before entering the loop. This can buy better performance due to doing more work before dealing with the overhead of the loop itself. – unwind Jan 10 '13 at 10:58
  • @larsmans pass -funroll is telling the optimizer what to do. That is not what I want. I want to give him info and let the optimizer work. – shodanex Jan 10 '13 at 11:05
  • Is this just an example or actual code? If it's actual code, what is the target? On x86 with SSSE3 support, you can reverse the bits in 16 bytes at once with a nice `pshufb` trick, or with SSE5, there is `vpperm`. IIRC there is a fast bit-reversal instruction on ARM, too. – harold Jan 10 '13 at 16:06
  • @harold There is a 32 bit reversal instruction and a byte order reversal instruction on the target CPU, however I am interested in the more general cases where this might be useful, and the provided case is just an example. – shodanex Jan 11 '13 at 09:45

2 Answers2

7

There is hot attribute on functions to give a hint to compiler about hot-spot: http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html. Just abb before your function:

static void bitreverse(vbuf_desc * vbuf) __attribute__ ((pure));

Here the docs about 'hot' from gcc:

hot The hot attribute on a function is used to inform the compiler that the function is a hot spot of the compiled program. The function is optimized more aggressively and on many target it is placed into special subsection of the text section so all hot functions appears close together improving locality. When profile feedback is available, via -fprofile-use, hot functions are automatically detected and this attribute is ignored.

The hot attribute on functions is not implemented in GCC versions earlier than 4.3.

The hot attribute on a label is used to inform the compiler that path following the label are more likely than paths that are not so annotated. This attribute is used in cases where __builtin_expect cannot be used, for instance with computed goto or asm goto.

The hot attribute on labels is not implemented in GCC versions earlier than 4.8.

Also you can try to add __builtin_expect around your idx < vbuf->bytesused - it will be hint that in most cases the expression is true.

In both cases I'm not sure that your loop will be optimized.

Alternatively you can try profile-guided optimization. Build profile-generating version of program with -fprofile-generate; run it on target, copy profile data to build-host and rebuild with -fprofile-use. This will give a lot of information to compiler.

In some compilers (not in GCC) there are loop pragmas, including "#pragma loop count (N)" and "#pragma unroll (M)", e.g. in Intel; unroll in IBM; vectorizing pragmas in MSVC

ARM compiler (armcc) also has some loop pragmas: unroll(n) (via 1):

Loop Unrolling: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJACACFE.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJAHJDAB.html

and __promise intrinsic:

Using __promise to improve vectorization

The __promise(expr) intrinsic is a promise to the compiler that a given expression is nonzero. This enables the compiler to improve vectorization by optimizing away code that, based on the promise you have made, is redundant. The disassembled output of Example 3.21 shows the difference that __promise makes, reducing the disassembly to a simple vectorized loop by the removal of a scalar fix-up loop.

Example 3.21. Using __promise(expr) to improve vectorization code

void f(int *x, int n)
{
    int i;
    __promise((n > 0) && ((n&7)==0));
    for (i=0; i<n;i++) x[i]++;
}
osgx
  • 90,338
  • 53
  • 357
  • 513
  • PS. Bingfeng Mei from broadcom has some patches to add `loop count` pragma to gcc, check his [message in gcc maillist](http://gcc.gnu.org/ml/gcc/2009-11/msg00454.html) – osgx Jan 10 '13 at 12:24
0

You can actually specify the exact count with __builtin_expect, like this:

while (idx < __builtin_expect(vbuf->bytesused, 1280*400)) {

This tells gcc that vbuf->bytesused is expected to be 1280*400 at runtime.

Alas, this does nothing for optimization with current gcc version. Haven't tried with 4.8, though.

Edit: Just realized that every standard C compiler has a way to exactly specify the loop count, via assert. Since the assert

#include <assert.h>
...
assert(loop_count == 4096);
for (i = 0; i < loop_count; i++) ...

will call exit() or abort() if the condition is not true, any compiler with value propagation will know the exact value of loop_count. I always thought that this would be the most elegant and standard-conforming way to give such optimization hints. Now, I want a C compiler that actually uses this information.

Note that if you want to make this faster, bytewise unrolling might be less effective than using a wider lookup table. A 16-bit table would occupy 128K, and thus often fit into the CPU cache. If the data is not completely random, an even wider table (3 bytes) might be effective.

2-byte example:

unsigned short *bitrev2;
...
for (idx = 0; idx < vbuf->bytesused; idx += 2) {
    *(unsigned short *)(&img[idx]) = bitrev2[*(unsigned short *)(&img[idx]);
}

This is an optimization the compiler can't perform, regardless of the information you give it.

Chris
  • 4,133
  • 30
  • 38
  • 1
    Or I could use the rbit + rev instructions to handle 4 bytes at a time. However I am not only interested in this particular function, but in the general case of communicating hint to gcc. Other compiler have this kind of hints. – shodanex Jan 11 '13 at 09:42
  • @shodanex As mentioned, gcc has the hint, too. It just does not use it for unrolling with this sample code. This does not rule out the possibility that it would do so for other code. Have you checked whether the other compilers actually USE their hint? Anyone here who uses Intel C? – Chris Jan 11 '13 at 09:59
  • No it has not if the loop count is not specified, and use the info to unroll if specified. I will try the assert thing – shodanex Jan 13 '13 at 19:50
  • "Since the assert will call exit() or abort()" - compiler has no idea about particular "assert" implementation. In many cases it is macro thus compiler does not even see the symbol "assert" it sees something completely different (usually if () call). Also compiler have no idea what exit() or abort() functions mean. In particular bare to the metal code I can write my own "abort" and "exit" which do not do what compiler may think is logical. No way assert(something) gives compiler any useful hint unless assert will become a language feature like in Java. (Sorry for being late for the party). – Leo Jan 09 '15 at 22:14
  • @Leo any hosted standard C compiler may assume that the standard library functions have the standard semantics. gcc certainly does so, unless you disable it with -ffreestanding. – Chris Jan 12 '15 at 14:56
  • http://pubs.opengroup.org/onlinepubs/009695399/basedefs/assert.h.html says: "If NDEBUG is defined as a macro name before the inclusion of this header, the assert() macro shall be defined simply as: #define assert(ignore)((void) 0)" compiler sees the result of preprocessing: "((void) 0)" Even assuming it is super smart and assert semantics does not change how on earth it can derive any useful hints from "((void) 0)" text? `nough said. Peace. – Leo Jan 12 '15 at 19:42