Writing a C macro to use inside a CUDA kernel

Question

I have the following structure inside a code and it has been used many times. So, improving the code readability and decreasing the number of lines, I really need to use a macro instead of that. The part which I am looking to write a macro for it is as follow:

#define _UNROLL_FACTOR_volIntGrad 32
    int jj = 0;
    for (; jj < (ngbSize - 32); jj += 32) {
        int j = offset + jj;
#pragma unroll
        for (int k = 0; k < 32; k++){
        ...
        arbitrary calculation 1 (depends on k)
        ...
        }
        ...
        arbitrary calculation 2
        ...
    }

    for (; jj < (ngbSize - (_UNROLL_FACTOR_volIntGrad / 2)); jj+= (_UNROLL_FACTOR_volIntGrad / 2)){
        int j = offset + jj;
#pragma unroll
        for (int k = 0; k < 16; k++){
        ...
        arbitrary calculation 1 (depends on k)
        ...
        }
        ...
        arbitrary calculation 2
        ...
    }

    for (; jj < (ngbSize - (_UNROLL_FACTOR_volIntGrad / 4)); jj+= (_UNROLL_FACTOR_volIntGrad / 4)){
        int j = offset + jj;
 #pragma unroll
        for (int k = 0; k < 8; k++){
        ...
        arbitrary calculation 1 (depends on k)
        ...
        }
        ...
        arbitrary calculation 2
        ...
    }

    for (; jj < (ngbSize - (_UNROLL_FACTOR_volIntGrad / 8)); jj+= (_UNROLL_FACTOR_volIntGrad / 8)){
        int j = offset + jj;

 #pragma unroll
        for (int k = 0; k < 4; k++){
        ...
        arbitrary calculation 1 (depends on k)
        ...
        }
        ...
        arbitrary calculation 2
        ...
    }

    for (; jj < (ngbSize - (_UNROLL_FACTOR_volIntGrad / 16)); jj+= (_UNROLL_FACTOR_volIntGrad / 16)){
        int j = offset + jj;
#pragma unroll
        for (int k = 0; k < 2; k++){
        ...
        arbitrary calculation 1 (depends on k)
        ...
        }
        ...
        arbitrary calculation 2
        ...
    }
    for (; jj < ngbSize; jj++){
        int j = offset + jj;
        ...
        arbitrary calculation 3
        ...
    }
}

by arbitrary calculation X, I mean a set of calculations which is independent of macro and differs function by function. Does anyone know how to write this macro in order to decrease the above structure's size? for example like the following:

__MACRO
     arbitrary calculation 1 
     arbitrary calculation 2
     arbitrary calculation 3
__END

I presume you've done a fair bit of testing to establish that manually unrolling (and the corresponding increase in complexity and decrease in legibility) causes a statistically significant and worthwhile performance benefit? — EOF, Aug 05 '16 at 14:56
@EOF Yes exactly right! This part of a GPU kernel and really need to make the loop size know for compiler to make it possible to unroll in order to improve the performance. But legibility decreases :-(. — Siamak, Aug 05 '16 at 14:59
Sounds like an XY problem5. Never use a macro you can use other measures. Problem is, you tagged for two **different** languages, C and C++. Both provide different alternatives. Pick the language you actually use and you might get a better alternative. (You have to clar4ify your question, too, though). — too honest for this site, Aug 05 '16 at 14:59
@Siamak So is this actually opencl or cuda or compute-shaders or anything like that? Otherwise, modern C-compilers will generally unroll automatically where applicable, and your manual unrolling can interfere with that. — EOF, Aug 05 '16 at 15:07
@EOF Actually it is CUDA. The problem is not manually unroll. I have unknown `ngbSize`, so I break it to smaller parts with known loop size and make compiler able to unroll it (at least the smaller part). Please look at the code and you will realize the process. — Siamak, Aug 05 '16 at 15:10
@Siamak: I don't think your current approach of multiple differently unrolled loops is reasonable. I'd try to padd the blocks to a common size and unroll for that size only. — EOF, Aug 05 '16 at 15:15
@EOF As the `ngbSize`is not known (lets assume somthing between 30 and 50), I need to start with a big value such as `32` and decrease it step by step to decrease the non-unrolled part size as much as possible. — Siamak, Aug 05 '16 at 15:21
@Siamak: Or you padd it to a multiple of 32. You only need one unrolled loop, which is simple enough to write that you don't need a macro. It's probably considerably faster too. — EOF, Aug 05 '16 at 15:25
Lets say `ngbSize` is 32. `jj` is 0 and the code starts with `for (; jj < (ngbSize - 32); jj += 32)`. Is `0 < (ngbSize - 32)`? — kfsone, Aug 05 '16 at 17:53
@kfsone In this case, it doe not go inside the loop.and jumps to next loop with 'ngbSize = 16' . — Siamak, Aug 06 '16 at 11:31

score 2 · Answer 1 · edited May 23 '17 at 12:19

2

In C++, it is mostly frowned upon to use macros for anything other than include guards and conditional compilation for platform dependencies. The best thing to do would be to create a static constant that will be internally linked and has a single point of maintenance. You can put this at the top of your file.

If you're using C++11, then you can use a constexpr for what you are trying to do. The compiler will know your statement has a type rather than just a text replacement, which is essentially what C-style macros do.

The purpose of a constexpr is to create an immutable object that can be computed at compile time, sort of like a static constant. But the great thing about them is that you can create static functions with them, so that would be useful in your case where you are doing calculations that are dependent on others.

See the usage for constexpr here: When should you use constexpr capability in C++11?

edited May 23 '17 at 12:19

Community

1
1

answered Aug 05 '16 at 15:00

Jessica Aboukasm

29
4

2

You might want to explain how `constexpr` is supposed to help with the OP's particular problem, otherwise this doesn't look like it answers the question. – EOF Aug 05 '16 at 15:03
@JessicaAboukasm Unfortunately the OP has changed his mind, this is a C (in fact, CUDA) question :( – kfsone Aug 05 '16 at 17:55
@kfsone the CUDA `nvcc`compiler follows C++ rules, and claims C++ compliance. – Robert Crovella Aug 05 '16 at 20:49

Writing a C macro to use inside a CUDA kernel

1 Answers1