How can I accelerate this code (MWE!), e.g. using restrict

Question

Is there any way I can accelerate this function:

void task(int I, int J, int K, int *L, int **ids, double *bar){ 
    double *foo[K];
    for (int k=0;k<K;k++)
        foo[k] = new double[I*L[k]];
        // I am filling these arrays somehow
        // This is not a bottleneck, hence omitted here        
    for (int i=0;i<I;i++)
        for (int j=0;j<J;j++){
            double tmp = 1.;
            for (int k=0;k<K;k++)
                tmp *= foo[k][i*L[k]+ids[j][k]]; //ids[j][k]<L[k]
            bar[i*J+j] = tmp;
        }
}

Typical values are: I = 100,000, J = 10,000, K=3, L=[50,20,60].

I read that the __restrict__ keyword/extension could help, but am not sure how to apply it here. For example, trying to put it into the definition of foo[k] = new double[...] I get error: '__restrict_ qualifiers cannot be applied to double. Furthermore, I don't know whether I should / how I could declare ids and ids[j], 1<= j<= J as restricted.

As a note, in my actual code, I execute as such tasks in parallel in as many threads as my CPU has cores.

I am writing mostly C-compatible C++, so solutions in both languages are welcome.

`restrict` ain't part of C++, nor is it well defined. See https://stackoverflow.com/questions/776283/what-does-the-restrict-keyword-mean-in-c for details. How about filling in a `std::vector` and returning by value, than you don't even need restrict — JVApen, Jan 06 '19 at 09:10
No need of `restrict` here, you don't modify possible aliases. — Jarod42, Jan 06 '19 at 09:11
Do you *have* to use this ragged/nested array format, instead of a flat 1D array? There are probably speedups other than `__restrict`, but I'm not sure I understand what this code is doing. Probably just manually hoisting stuff like `ids[j]` out of the loop will help, and looping over `foo[k][ stuff(k) ]` looks pretty nasty for an inner loop. Can't you build a more useful data structure in your initial loop, outside the triple-nested loop, that gives you better memory access patterns? — Peter Cordes, Jan 06 '19 at 09:12
`double *foo[K];` is not valid C++ as it uses VLA extension. — Jarod42, Jan 06 '19 at 09:12
_@Bananach_ If you have working code and ask for improvement you should better post that question at [SE Code Review](https://codereview.stackexchange.com/questions). Be sure that your question meets the policies setup there. — πάντα ῥεῖ, Jan 06 '19 at 09:14
Also, you don't show how you tried to put `__restrict` into the declaration of `double *foo[K];`, just some error message. So this isn't a [mcve] of what you tried. But it looks like you're saying you tried to put it into an assignment, not a declaration. — Peter Cordes, Jan 06 '19 at 09:17
@Jarod42 but `foo` and `bar` could point to the same memory (also, you're right about the VLA, didn't notice that when I boiled down the real code. just assume `K` is global) — Bananach, Jan 06 '19 at 10:08
@PeterCordes the nested array format of `foo` is semi-necessary. I could force set all entries of `L` to the largest entry. But I felt that that wouldn't help with the nonlocal access pattern caused by the irregularity of the entries of `ids` — Bananach, Jan 06 '19 at 10:09
@πάνταῥεῖ Just in case you were the close voter: Just because the question fits Code Review doesn't mean that it does not fit here, see https://stackoverflow.com/help/on-topic "We feel the best Stack Overflow questions have a bit of source code in them, but if your question generally covers a specific programming problem [...] then you’re in the right place to ask your question!" — Bananach, Jan 06 '19 at 10:13
I wrote that before I notice that you were doing 2D indexing *inside* each `foo[k][ ... ]`. Anyway, the reason I couldn't get sane asm from your code is that your inner-most loop does `K++` (incrementing the array bound), not `k++`. No wonder it was a total mess that didn't actually FP multiply, and that the compiler could never prove any of the integers were non-negative and didn't need sign-extension. — Peter Cordes, Jan 06 '19 at 10:16
@Bananach Well, if you're posting here, you should at least provide a [mcve] reproducing your problem (which isn't a strict requirement at SE Code Review). And yes, the close vote is mine. — πάντα ῥεῖ, Jan 06 '19 at 10:17
@πάνταῥεῖ It isn't a strict requirement here either, as per the help center. It is only for debugging questions. The main question here is "how do I accelerate this code". And the code without any `restrict` IS the MCVE from where to start. Well, I guess I should have added a main, and if that's what you find lacking I will be happy to add one. — Bananach, Jan 06 '19 at 10:22
Three billion iterations is always going to take a while, and your scattered memory accesses aren't exactly cache-friendly. How long does it take, and how much time are you expecting to save by optimising? — molbdnilo, Jan 06 '19 at 10:22
That array index expression hurts my eyes. But it does look bad, always be sure to address an array in storage order to maximize the odds that the processor caches are used efficiently. So k must be incremented in the outer-most loop. Doing it inner-most is horribly expensive. — Hans Passant, Jan 06 '19 at 10:23
@molbdnilo how long it takes depends on what machine I'm running it on. I hope to save as much as possible. I normally write in python and thus have no idea what I can expect or what I could do to write high perfomance code — Bananach, Jan 06 '19 at 10:28
`foo` and `bar` cannot point to the same address, one is local, whereas the other is given in argument... — Jarod42, Jan 06 '19 at 16:58

Peter Cordes · Accepted Answer · 2019-01-06T11:53:03.910

https://en.cppreference.com/w/c/language/restrict claims you can declare an array of restrict pointers to double like so in C99/C11:

typedef double *array_t[10];
restrict array_t foo;        // the type of a is double *restrict[10]

But only gcc accepts that. I think this is a GCC-ism, not valid ISO C11. (gcc also accepts
array_t restrict foo_r; but no other compilers accept that either.)

ICC warns "restrict" is not allowed, clang rejects it with

<source>:16:5: error: restrict requires a pointer or reference ('array_t' (aka 'double *[10]') is invalid)
    restrict array_t foo_r;
    ^

MSVC rejects it with error C2219: syntax error: type qualifier must be after '*'

We get essentially the same behaviour in C++ from these compilers with __restrict, which they accept as a C++ extension with the same semantics as C99 restrict.

As a workaround, you can instead use a qualified temporary pointer every time you read from foo, instead of f[k][stuff]. I think this promises that the memory you reference through fk isn't the same memory you access through any other pointers within the block where fk is declared.

double *__restrict fk = foo[k];
tmp *= fk[ stuff ];

I don't know how to promise the compiler that none of the f[0..K-1] pointers alias each other. I don't think this accomplishes that.

You don't need __restrict here.

I added __restrict to all the pointer declarations, like int *__restrict *__restrict ids and it doesn't change the asm at all, according to a diff pane on the Godbolt compiler explorer: https://godbolt.org/z/4YjlDA. As we'd expect because type-based aliasing lets the compiler assume that a double store into bar[] doesn't modify any of the int * elements of int *ids[]. As people said in comments, there's no aliasing here that the compiler can't already sort out. And in practice it appears that it does sort it out, without any extra reloads of pointers.

It also can't alias *foo[k], because we got those pointers with new inside this function. They can't be pointing inside bar[].

(All the major x86 C++ compilers (GCC,clang,ICC,MSVC) support __restrict in C++ with the same behaviour as C99 restrict: a promise to the compiler that stores through this pointer don't modify objects that are pointed to by another pointer. I'd recommend __restrict over __restrict__, at least if you mostly want portability across x86 compilers. I'm not sure about outside of that.)

It looks like you're saying you tried to put __restrict__ into an assignment, not a declaration. That won't work, it's the pointer variable itself that __restrict applies to, not a single assignment to it.

The first version of the question had a bug in the inner loop: it had K++ instead of k++, so it was pure undefined behaviour and the compilers got weird. The asm didn't make any sense (e.g. no FP multiply instruction, even when foo[] was a function arg). This is why it's a good idea to use a name like klen instead of K for an array dimension.

After fixing that on the Godbolt link, there's still no difference in the asm with / without __restrict on everything, but it's a lot more sane.

BTW, making double *foo[] a function arg would let us look at the asm for just the main loop. And you would actually need __restrict because a store to bar[] could modify an element of foo[][]. This doesn't happen in your function because the compiler knows that new memory isn't pointed-to by any existing pointers, but it wouldn't know that if foo was a function arg.

There's a small amount of the work inside the loop is sign-extending 32-bit int results before using them as array indices with 64-bit pointers. This adds a cycle of latency in there somewhere, but not the loop-carried FP multiply dependency chain so it may not matter. You can get rid of one instruction inside the inner loop on x86-64 by using size_t k=0; as the inner-most loop counter. L[] is a 32-bit array, so i*L[k] needs to be sign-extended inside the loop. Zero-extension from 32 to 64-bit happens for free on x86-64, so i * (unsigned)L[k] saves a movsx instruction in the pointer-chasing dep chain. Then the inner loop that gcc8.2 makes is all necessary work, required by your nasty data structures / layout. https://godbolt.org/z/bzVSZ7

I don't know whether that's going to make a difference or not. I think more likely the memory access pattern causing cache misses will be your bottleneck with real data.

It also can't auto-vectorize because the data isn't contiguous. You can't get contiguous source data from looping over j or i, though. At least i would be a simple stride without having to redo ids[j][k].

If you generate foo[k][...] and bar[...] transposed, so you index with foo[k][ i + L[k] * ids[j][k] ], then you'd have contiguous memory in src and dst so you (or the compiler) could use SIMD multiplies.

I'm very sorry for the bug. I shouldn't have started editing the code once I pasted it here — Bananach, Jan 06 '19 at 10:16
I did try `foo[k] = new __restrict__ double[I*L[k]];`. I get your point that `__restrict` should be part of the declaration instead of the definition, but the question is then: How do I declare an array of pointers, such that each entry has type `double *__restrict`? — Bananach, Jan 06 '19 at 10:17
Thanks, I just understood why bar and foo are not in aliasing danger. — Bananach, Jan 06 '19 at 10:57
Do I understand correctly that since indexing 64-bit pointers (e.g. `bar`) with 32-bit integers is expensive I should make `i` and `j` 64-bit? Furthermore, if I understnd you correctly, I should use `i*(unsigned)L[k]` instead of `i*L[k]` and `size_t k` instead of `int k`. But isn't `uint64 L` and `uint64_t k` even better? — Bananach, Jan 06 '19 at 11:20
@Bananach: oh, declaring a `double *__restrict foo[10];` is an interesting question. That syntax doesn't work, but it seems there's a way using a typedef that gcc accepts. https://godbolt.org/z/0JJJSv. But the other compilers reject it so it's probably a GCC-ism and not valid ISO C11. I might ask a new question about that, since this question is mostly asking whether `__restrict` would make any difference. Of course you can work around on any compiler by using `double *__restrict fk = foo[k];` every time you load from the array. — Peter Cordes, Jan 06 '19 at 11:20
@Bananach: Sign-extending `i` and `j` happens once per outer loop, but yeah it might improve code-gen overall in this case. But no, `uint64_t L[]` wouldn't be any better, and would slightly increase your cache footprint and code size by requiring a REX prefix. And on Bulldozer-family and Silvermont, 64-bit `imul` is slower than 32-bit `imul` (https://agner.org/optimize/). With `i` in a register, it's always going to take a separate load instruction to get `i*L[k]` into a register without destroying `i`, so it wouldn't enable `imul r10, qword ptr [rbx + 8*rdi]` instead of `mov`+`imul`. — Peter Cordes, Jan 06 '19 at 11:28
Even if x86 was a 3-operand ISA, with `imul dst, src1, src2` instead of `imul dst, src`, writing a 32-bit result to a register zero-extends it for free to 64-bit. So just using `unsigned` is all that's necessary for that case. But for loop counters, `unsigned` would be much worse, because unlike signed, `unsigned` overflow is well defined to wrap around, so the compiler can't make as many assumptions based on the program not having undefined behaviour. It's unfortunately hard to give any kind of general recommendation for using signed vs. unsigned integers. — Peter Cordes, Jan 06 '19 at 11:33
I think the wasted instruction in gcc's code-gen that I eliminated with a `size_t` loop counter was actually just a missed optimization. `k < K` will definitely become false at some point for any possible `K`, so the loop can't be infinite. Yeah, clang doesn't seem to have that stupid `mov` + `lea` crap, so it's just hand-holding GCC being dumb, not anything fundamental about making the code easier to optimize. — Peter Cordes, Jan 06 '19 at 11:35
Really appreciate your efforts here, will have to do some reading up about CPU instructions to really understand it though. I hope the important point is that there is no obvious way to get improvements of what I posted — Bananach, Jan 06 '19 at 11:37

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

0

restrict does not matter in this case.

Your algorithm is rubbish and does not allow long vector operations to be used (so micro optimizations will not help here at all).

You need to find the way that the elements in the inner loop to occupy the consecutive block of array indexes. As it is done now the compiler has to read the every single element from different positions in the array, it bares the compiler from the loops unrolling and longer vector instructions. It may be also very cache memory unfriendly.

Rethink the algorithm first - premature optimizations will not help if the algorithm is extremely inefficient

Edit

After the OP comment I just want to show him waht is the difference between "naive" and more efficient (less naive but harder to understand one)

Lets consider the parity of the 32 bit unsigned value. The naive approach:

int very_naive_parity(const uint32_t val)
{
    unsigned parity = 0;
    
    for(unsigned bit = 0; bit < 32; bit++)
    {
        if(val & (1U << bit))
        {
            parity = !parity;
        }
    }
    return parity;
}

It is very easy to write and understand but it is extremely inefficient. At least 288 instructions will be executed to calculate this parity.

more efficient:

int parity(const uint32_t val)
{
    uint32_t tmp = val;
    
    tmp ^= tmp >> 16;
    tmp ^= tmp >> 8;
    tmp ^= tmp >> 4;
    return (0b110100110010110 >> (tmp & 0x0f)) & 1;
}

will be executed in 9 instructions (both calculations without function prologues and epilogues) Is it harder to understand? - definitely yes. But as I wrote efficiency usually means less easy for humans.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 06 '19 at 10:34

0___________

60,014
4
34
74

"Your algorithm is rubbish". Without having any suggestions of improving it, that's maybe a bit harsh. Do you know of any way to give the exact same results that `task` gives, for arbitrary inputs of `ids`? – Bananach Jan 06 '19 at 10:47
1

@Bananach I geve the direction in my answer. You cant write the efficient code having the "naive" inefficient algorithm. As I state in my answer you need to make arrays organized the way which does not require reads from the non consecutive array elements. It may require rewriting the large parts of your code. When the performance is the important factor you need to have the efficient algorithm – 0___________ Jan 06 '19 at 11:30
@Bananach *How can you say the algorithm is rubbish?* - I see it. You need to rethink it. In your case the data has to be organized another way. Will it make code more difficult to write? - Yes but it usually the price of the performance. **But it is your program** and you can write it any way you want. But if ask about the opinion you need to be prepared to hear negative opinions about your work. – 0___________ Jan 06 '19 at 11:52
@Bananach: I wonder if sorting `ids` as you construct `foo[]` would be helpful. Or to somehow factor it out and apply that remapping before or after the `O(I*J*K)` multiply work, maybe on the way from a temporary array to `bar`. Although if `K=3` is typical, it's not nearly as bad as a matmul. I have to agree with the harsh assessment here: major restructuring could have huge performance gains, because this code inherently has huge bottlenecks that probably reduce throughput far below even maxing out scalar FP multiply. Maybe you can take advantage of something we don't know about `ids[]`. – Peter Cordes Jan 06 '19 at 11:58
@ P__J__ I deleted my comment before you wrote that second answer. My point was that if I, say, sorted `ids`, then did computations, and then unsorted, this would only change the *implementation*, so if it accelerated my code, it would be exactly the kind of thing I asked for, and it would not support the view that the *algorithm* is bad but only that the *implementation* is bad. I guess I just misunderstood what you mean by "algorithm". I'm doing this for a scientific computation and in that context, the algorithm behind this is well analyzed and has perfect scaling and complexity etc. – Bananach Jan 06 '19 at 12:07
@Bananach writing efficient code very often requires to sacrify something else. In your case probably you will need much larger tables with repeated data to archive this. You need to see fow your data is organized, how it is accessed and then you need to think how to optimize it having the performance in mind. I am not going to give you the ready solution. – 0___________ Jan 06 '19 at 12:17
@Bananach: if any decent amount of your data fits in cache, or if there's any kind of locality anywhere, repeating it might be worse than an extra level of indirection while scattering around in it. Worth experimenting both ways, but there's going to be a threshold at some point where huge but sequential for maximum brute force isn't actually better than more compact for more cache hits. Especially if you can get any decent amount of L2 cache hits (typically 256kiB), e.g. from cache-blocking optimizations. But IDK which side of the threshold your problem might come down on. – Peter Cordes Jan 06 '19 at 12:55

How can I accelerate this code (MWE!), e.g. using restrict

2 Answers2

Edit