4

I'm familiar with the usage of the __restrict keyword for performance optimization in C and specifically CUDA in this case.

void Foo(const float* __restrict X, const float* __restrict Y);

I understand that this Foo function has __restrict keywords which indicate to the compiler that X and Y are guaranteed to point to distinct blocks of memory.

What happens when we have a pointer to a pointer as far as alias restriction?

void Bar1(const float* const * __restrict X, const float* const * __restrict Y);
void Bar2(const float* const __restrict * __restrict X, const float* const __restrict * __restrict Y);

Is Bar1 fully restricted or does each level of indirection need to be restricted as shown in Bar2?

Which syntax correctly indicates that all pointers can take advantage or read-only caching? Do I need to "restrict" both pointers or only the top level variable name?

Russell Trahan
  • 783
  • 4
  • 34
  • 1
    All pointers point to const data. Is any float written in this code via a pointer? – tstanisl Mar 15 '21 at 23:14
  • The function is only used to access the read-only float data. My question is targeting how can I best suggest to the compiler to cache the intermediate pointers by using the restrict keyword. Do I need to "restrict" both pointers or only the top level variable name? – Russell Trahan Mar 15 '21 at 23:54
  • By [my read](https://en.cppreference.com/w/c/language/restrict) it should not be necessary to decorate anything other than a single (ie. "top-level") pointer. To wit: "if some object that is accessible through P (directly or indirectly) is modified, by any means, then all accesses to that object (both reads and writes) in that block must occur through P (directly or indirectly), otherwise the behavior is undefined". Note that this is looking at reference material. I'm not making a statement about what any particular compiler behavior *is*, just what it *could be*. – Robert Crovella Mar 16 '21 at 00:00
  • Your question is tagged C, but CUDA claims compliance to C++, not C, and furthermore CUDA provides [its own definition of restrict](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#restrict), however I don't see any conflicts there with what I have said. in any event, `restrict` is simply a contract you make with the compiler. What exactly the compiler does when that contract is made may be implementation-specific, like a hint. Or perhaps someone will correct me if I am wrong about this last statement. – Robert Crovella Mar 16 '21 at 00:03
  • 1
    In fairness, CUDA's `__restrict__` was modeled on `restrict` in ISO-C99. To my knowledge, ISO-C++ has yet to define `restrict` (that's why it is `__restrict__` with underscores, putting it into the compiler's name space). As for interaction of multiple levels of indirection with `__restrict__`, the last time I discussed this with a compiler engineer, my take-home message was this: Very confusing situation with nonobvious semantics that are unlikely to be what one thinks they are. So in terms of practical advice and as Robert Crovella suggests, best stick with decorating top-level pointers only – njuffa Mar 16 '21 at 00:55
  • If the data are read-only then `restrict` is pointless. It is used to tell then compiler that specific data will not be affected by modification of other data. As there are no modifications, the `restrict` is redundant, isn't it? – tstanisl Mar 16 '21 at 09:09
  • Has anyone tested it? I just tested it here: https://godbolt.org/z/Wq3bvP Sure it is no cuda (I have no gpu available) but at least it shows that there is a compiler that produces the same output with just one `restrict` but a different one with two `restrict`s. Of course it is also a different function header than in the question but again I just wanted to show one _may_ need both `restrict`s. So I would say, just test it for your specific compiler (version) and function combination and decide based on the results. – BlameTheBits Mar 16 '21 at 18:41
  • A [simple test](https://godbolt.org/z/zjxf4v) for the CUDA device compiler appears to generate the same code whether the top is decorated or both are decorated. – Robert Crovella Mar 17 '21 at 02:53
  • @RobertCrovella Oh, they have a CUDA. Nice, didn't know that. The problem with your test is that the compiler also generates the same code with no `restrict` at all. I got also always the same with my ported example. With the example from the [programming guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#restrict) however it's different and its actually [the same](https://godbolt.org/z/z7Y9j6) with one and two `restrict`s but different with none. – BlameTheBits Mar 17 '21 at 07:32
  • Definitely adding any restrict keywords in CUDA improved the performance for reading the data. Unfortunately, I can't post specifics online. Const alone wasn't enough to get NVCC to utilize the cache heavily. – Russell Trahan Mar 18 '21 at 02:51
  • @RobertCrovella Can you post something about your test? I'd be curious to see and mark that as the answer for now, unless something better comes along. – Russell Trahan Mar 18 '21 at 02:51
  • @RobertCrovella On second glance, removing the const and restrict keywords makes no difference in the instructions. Something may be missing here. – Russell Trahan Mar 18 '21 at 19:35
  • Yes, that was already pointed out in the comments. If you continue to read the comments, BlameTheBits offered an example that is claimed to fit the expectation. Without the decorator, you get a certain code configuration. With either arrangement of the decorator, you get a different config. I believe this is consistent with what I stated as an expectation. – Robert Crovella Mar 18 '21 at 19:41
  • https://godbolt.org/z/EbKMzM I don't know if that was the best example code because each pointer was only dereferenced once--caching doesn't really matter. Here's a modified snippet which shows the restrict makes a difference versus no restrict, but 2 restricts is the same as 1 restrict. – Russell Trahan Mar 18 '21 at 19:48

1 Answers1

2

Do I need to "restrict" both pointers or only the top level variable name?

Restrict both "levels" of pointers.

Even if this is not necessary for enabling use of the non-coherent/read-only cache - that is still the right choice, because you are being more explicitly in your description of the input parameters. You are making it clearer to the person using your function that they are expected not to have the inner pointer point to overlapping locations.

einpoklum
  • 118,144
  • 57
  • 340
  • 684