3

Did I used the pre-fetch instruction correctly to reduce memory latency?

Can I do better than this?

When I compile the code with -O3, g++ seems to unroll the inner loop (code at godbolt.org).

The architecture of the CPU is Broadwell.

Thanks.


Backward loop over an array and read/write elements.

Each calculation depends on the previous calculation.

#include <stdlib.h>
#include <iostream>
int main() {
    const int N = 25000000;
    float* x = reinterpret_cast<float*>(
            aligned_alloc(16, 4*N)
    );  // 0.1 GB

    x[N - 1] = 1.0f;

    // fetch last cache line of the array
    __builtin_prefetch(&x[N - 16], 0, 3);

    // Backward loop over the i^th cache line.
    for (int i = N - 16; i >= 0; i -= 16) {
        for (int j = 15; j >= 1; --j) {
            x[i + j - 1] += x[i + j];
        }
        __builtin_prefetch(&x[i - 16], 0, 3);
        x[i - 1] = x[i];
    }

    std::cout << x[0] << "\n";
    free(x);
}
R zu
  • 2,034
  • 12
  • 30
  • I think gcc is unrolling enough to avoid redundantly prefetching the same cache line multiple times. Neat. HW prefetch into L2 at least does work for descending order, but maybe not into L1d. And BTW, there are ways to SIMD-vectorize this prefix-sum, but the compiler isn't auto-vectorizing it for you. [parallel prefix (cumulative) sum with SSE](https://stackoverflow.com/q/19494114). Reversing the direction of memory access doesn't change the fundamental problem. – Peter Cordes Oct 03 '18 at 23:36
  • 1
    Anyway, no, you're not prefetching effectively. The prefetch distance should be quite a few cache lines ahead of where you're currently accessing. But the optimal prefetch distance depends on the hardware and maybe even hyperthreading for cache pressure from the other thread. – Peter Cordes Oct 03 '18 at 23:38
  • @PeterCordes Thanks! I am trying to learn cache blocking as you suggested at https://stackoverflow.com/questions/51933660/xeon-cpu-e5-2603-backward-memory-prefetch – R zu Oct 04 '18 at 00:32
  • @PeterCordes `UOPS_ISSUED.STALL_CYCLES` seems to be very high for this code. I'm not sure why. `RESOURCE_STALLS.RS` is also a little high due to port contention. The prefetching instruction actually harms performance irrespective of the distance. – Hadi Brais Oct 04 '18 at 09:45
  • @HadiBrais: It bottleneck on the loop-carried `addss` dep chain, not on front-end throughput, so it's nor surprising there are huge amounts of stall cycles with no uops issued. The only ILP is the loop overhead and the stores. – Peter Cordes Oct 04 '18 at 10:39
  • 1
    @PeterCordes Yes, but then I'd expect it to run in the worst case at about 32 cycles per iteration, but it's actually 5 times worse than that on Haswell/Broadwell. And it's not because of `RESOURCE_STALLS.RS` alone. IACA even says it should run at 18.4 cycles per iteration. Since most loads actually hit in the L1D, I'd expect IACA to be close, but it's far off. – Hadi Brais Oct 04 '18 at 17:07
  • @HadiBrais: hmm, aligned_alloc doesn't zero memory, but in practice an allocation that large should be fresh zeroed pages from the OS. So we shouldn't be getting FP assists for subnormals. – Peter Cordes Oct 04 '18 at 17:11
  • @HadiBrais: I tried it with the OP's code unmodified. It only takes 63ms, kinda too short for a good benchmark, and it probably spends significant time on page faults. `cycles:u` was 120M, `cycles` was 236M. page faults was ~24.5k. IPC=0.69, UOPS_ISSUED.STALL_CYCLES was 174M while UOPS_ISSUED.STALL_CYCLES:u was ~103M. (With of ~210M `uops_issued.any`.) This is on Skylake (4 cycle `vaddsd` latency), without any care taken to warm up the CPU frequency or whatever, just a quick test. I compiled with `g++ -O3 -march=skylake`. Transparent hugepage defrag = `[defer+madvise]` – Peter Cordes Oct 04 '18 at 17:19
  • How do you guys get these numbers? I mean the `UOPS_ISSUED.STALL_CYCLES` etc. I am thinking about storing two copies of input data. One in forward and another in reversed arrangement. Then this becomes just backward writes instead of backward read/write. Would it help if I do a lot more calculations in the loop? – R zu Oct 04 '18 at 17:27
  • @PeterCordes On Haswell, cycles:u=105M, UOPS_ISSUED.STALL_CYCLES:u=218M (which doesn't seem to be counting cycles as it claims to be), RESOURCE_STALLS.RS:u=61M, and IPC=0.52. This means it's executing at 67.2 cycles per iteration or about 1.86 cycles per instruction in the loop. Although the L1D hit rate is >99%. It appears there is something more than just port contention. I'm surprised that IACA got this so wrong. – Hadi Brais Oct 04 '18 at 18:43
  • @Rzu We use a tool called `perf` which uses hardware performance counters. Doing more calculations in the loop may help if these calculations use underutilized ports. – Hadi Brais Oct 04 '18 at 18:44
  • @PeterCordes The numbers I gave in my previous comment are for the case without the prefetch instruction in the loop. If I add the instruction, then cycles:u=115M, UOPS_ISSUED.STALL_CYCLES:u=229M, RESOURCE_STALLS.RS:u=71M, and IPC=0.5. – Hadi Brais Oct 04 '18 at 18:51
  • @Rzu The mere existence of the prefetch instruction makes your code slower. – Hadi Brais Oct 04 '18 at 18:52
  • @Rzu: You almost certainly don't want 2 copies of the same data using up cache footprint / memory bandwidth. – Peter Cordes Oct 04 '18 at 19:00
  • 1
    The L1 IP-based prefetcher can prefetch backwards and it only tracks load instructions. I think since all the loads in the same iteration access a single cache line, the IP-prefetcher tracks only one of them and detects that the stride is one (next line) so tries to prefetch the next few lines. The L2 can prefetch lines in the E coherence state, so the L1 receives them in the appropriate state for the stores as well. So there is really no need to software prefetching. – Hadi Brais Oct 04 '18 at 19:06
  • @PeterCordes It will be fine in my case. The computer has lots of memory. The program does not change the input data, which is all loaded into memory. I store a forward ordered input for one long forward loop with lots of calculation. I also store a backward ordered data for one long backward loop with lots of calculation. The whole program perform these two loops for many times. – R zu Oct 04 '18 at 19:31
  • The memory footprint itself is unimportant, the important part is the *cache* footprint. You want to work on small enough chunks that you get L2 cache hits, or maybe only L3 if you can't easily work in 128kiB to 256k chunks. (L2 is 1MiB in Skylake-X, though, so blocking for L2 is easier but also more important there). You said you were trying to learn about cache blocking. – Peter Cordes Oct 04 '18 at 19:34
  • Thank you again. The loop does not read the input forward and read the input backward at the same time... A loop read forward and next loop read backward. So I guess that will be fine. I worry that the memory won't get mapped to the cache when I read or write backward. If I can, I want to map a block of memory into the cache. Then write to the mapped memory in a reversed direction. Then fetch the next block (in a backward direction). And repeat. I doubt if the cpu can predict that I want to get the next block when I am done with this block because i deal with the blocks in a reversed order. – R zu Oct 04 '18 at 19:55
  • I will ask about backward pre-fetching, stride, and multi-threading in another question. Shouldn't be here. – R zu Oct 04 '18 at 20:02

0 Answers0