How to use PREFETCHT0 Instruction in my C code?

Question

I want to prefetch certain addresses (which are address of certain elements of a huge array) in my C program and see the effect of that on time taken.

The instruction about PREFETCH i found here PREFETCH0. But I am not aware of how to use it in C using inline assembly. It would be of great help if some body can give some idea how should I use this instruction with the address as argument, in C program.

Beware that `__builtin_prefetch`, `_mm_prefetch`, and inline asm all break auto-vectorization with current gcc and clang (https://godbolt.org/g/gfrwzx). **But normally, SW prefetch isn't recommended in loops over arrays** (except on P4), because HW prefetch works well for 1 to a few sequential read streams (but see https://stackoverflow.com/q/47851120/224132). But if you do want it, it seems you'll need to manually vectorize (or use Intel's compiler with a prefetch pragma or something. Sometimes useful for Xeon Phi, which has a different microarchitecture than the mainstream chips.) — Peter Cordes, Jan 18 '18 at 12:16

Flexo · Accepted Answer · 2016-11-05T20:54:50.750

Don't write it using inline assembly which would make the compiler's job harder. GCC has a built-in extension (See gcc builtins docs for more details) for prefetch you should use instead:

__builtin_prefetch(const void*)

This will generate code using the prefetch instructions of your target, but with more scope for the compiler to be smart about it.

As a simple example of the difference between inline ASM and gcc's builtin consider the following two files, test1.c:

void foo(double *d, unsigned len) {
  for (unsigned i = 0; i < len; ++i) {
    __builtin_prefetch(&d[i]);
    d[i] = d[i] * d[i];
  }
}

And test2.c:

void foo(double *d, unsigned len) {
  for (unsigned i = 0; i < len; ++i) {
    asm("prefetcht0 (%0)" 
        : /**/
        : "g"(&d[i])
        : /**/
    );
    d[i] = d[i] * d[i];
  }
}

(Note that if you benchmark that I'm 99% sure that a third version with no prefetch would be faster than both of the above, because you've got predictable access patterns and so the only thing that it really achieves is adding more bytes of instructions and a few more cycles)

If we compile both with -O3 on x86_64 and diff the generated output we see:

        .file   "test1.c"                                       |          .file   "test2.c"
        .text                                                              .text
        .p2align 4,,15                                                     .p2align 4,,15
        .globl  foo                                                        .globl  foo
        .type   foo, @function                                             .type   foo, @function
foo:                                                               foo:
.LFB0:                                                             .LFB0:
        .cfi_startproc                                                     .cfi_startproc
        testl   %esi, %esi      # len                                      testl   %esi, %esi      # len
        je      .L1     #,                                                 je      .L1     #,
        leal    -1(%rsi), %eax  #, D.1749                       |          leal    -1(%rsi), %eax  #, D.1745
        leaq    8(%rdi,%rax,8), %rax    #, D.1749               |          leaq    8(%rdi,%rax,8), %rax    #, D.1745
        .p2align 4,,10                                                     .p2align 4,,10
        .p2align 3                                                         .p2align 3
.L4:                                                               .L4:
        movsd   (%rdi), %xmm0   # MEM[base: _8, offset: 0B], D. |  #APP
        prefetcht0      (%rdi)  # ivtmp.6                       |  # 3 "test2.c" 1
                                                                >          prefetcht0 (%rdi)       # ivtmp.6
                                                                >  # 0 "" 2
                                                                >  #NO_APP
                                                                >          movsd   (%rdi), %xmm0   # MEM[base: _8, offset: 0B], D.
        addq    $8, %rdi        #, ivtmp.6                                 addq    $8, %rdi        #, ivtmp.6
        mulsd   %xmm0, %xmm0    # D.1748, D.1748                |          mulsd   %xmm0, %xmm0    # D.1747, D.1747
        movsd   %xmm0, -8(%rdi) # D.1748, MEM[base: _8, offset: |          movsd   %xmm0, -8(%rdi) # D.1747, MEM[base: _8, offset:
        cmpq    %rax, %rdi      # D.1749, ivtmp.6               |          cmpq    %rax, %rdi      # D.1745, ivtmp.6
        jne     .L4     #,                                                 jne     .L4     #,
.L1:                                                               .L1:
        rep ret                                                            rep ret
        .cfi_endproc                                                       .cfi_endproc
.LFE0:                                                             .LFE0:
        .size   foo, .-foo                                                 .size   foo, .-foo
        .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"               .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
        .section        .note.GNU-stack,"",@progbits                       .section        .note.GNU-stack,"",@progbits

Even in this simple case the compiler in question (GCC 4.8.4) has taken advantage of the fact that it's allowed to reorder things and chosen, presumably on the basis of an internal model of the target processors, to move the prefetch after the initial load has happened. If I had to guess it's slightly faster to do the load and prefetch in that order in some scenarios. Presumably the penalty for a miss and a hit is lower with this order. Or the ordering like this works better with branch predictions. It doesn't really matter why the compiler chose to do this though, the point is that it's exceedingly complex to fully understand the impact of even trivial changes to generated code on modern processors in real applications. By using builtin functions instead of inline assembly you benefit from the compiler's knowledge today and any improvements that show up in the future. Even if you spend two weeks studying and benchmarking this simple case the odds are fairly good that you'll not beat future compilers and you may even end up with a code base that can't benefit from future improvements.

Those problems are before we even begin to discuss portability of your code - with builtin functions they fall into one of two categories normally when on an architecture without support either graceful degradation or enabling emulation. Applications with lots of x86 inline assembly were harder to port to x86_64 when that came along.

Thanks. The above code generates inturn PREFETCH0. How can I (make compiler) generate other prefetch instructions also? My intention is not just prefetching but to see the effect of various types of it. How can I specify the 4 various instructions/hints PREFETCHT0, PREFETCHT1, PREFETCHT2, PREFETCHNTA in __builtin_prefetch() — ANTHONY, Nov 05 '16 at 07:45
But the compiler has more precise knowledge about `__builtin_prefetch` (and its *few* side effects) than when use use some `asm` instruction. — Basile Starynkevitch, Nov 05 '16 at 07:46
@BasileStarynkevitch Can you provide more details about your statement, why compiler finds it harder with asm than builtin way.? — ANTHONY, Nov 05 '16 at 07:53
@ANTHONY with inline asm the compiler just sees some inputs, some outputs, a block of stuff it can't understand at a high level or reason about and a few constraints. With a built-in function it ends up integrated into the program like any other statement or expression and can be intimately involved in any pass that reorders, picks registers or stack space etc. The compiler knows exactly what your intention is with the built-in but inline asm is little more than a string. I'll try and pull together a concrete example later. — Flexo, Nov 05 '16 at 08:21
@Flexo this is not an answer that should ever be on the "assembly" tag, it is "simply wrong". please limit such anwers to the "c++" tag or whatever else you prefer. *assume* that you have a time *critical* task and it is *theoretically* possible that the compiler is not optimal. inline assembly is there for a reason and it's a bad answer to a good question. you cannot use a builtin function in a asm block. — Abdul Ahad, Jan 17 '18 at 11:10
It makes no sense to prefetch the same element you're about to demand-load. `__builtin_prefetch(&d[i+128]);` might make sense (prefetch distance of `128 * sizeof(*d)` bytes), but really you'd want to unroll your loop some so you only prefetch about once per 64 bytes (cache line size). Especially on IvyBridge, which has a throughput performance bug for prefetch instructions, so this code could kill performance on IvB. — Peter Cordes, Jan 17 '18 at 18:18
@PeterCordes i think your first sentence / thesis is wrong. I'm not sure, but I think prefetch is used to asynchronously move data from RAM or L3 to L1, you can then perform some other operations, and by the time you load the data, it's in L1, which is closer to the CPU. I could be wrong. i've read that you have to keep busy for 50 cycles or so performing other operations, but the number could be off — Abdul Ahad, Jan 17 '18 at 19:22
@AbdulAhad: I meant in a loop where you're processing every array element, which is what Flexo's code is doing. In the more general case, yes you might prefetch something you're going to need, then do other stuff before loading it. I'm not sure exactly how out-of-order execution handles things, but a regular load into a register you don't read until several instructions later might be just as good. I'm not sure if a load uop can retire from the ROB before the data actually arrives, with only a load buffer still tracking it. In-order pipelined CPUs typically work that way, though. — Peter Cordes, Jan 17 '18 at 19:35
@Flexo: you should really unroll so you only prefetch once per 64 bytes, not once per `double`. Current GNU C compilers (gcc/clang/icc) don't optimize away extra prefetches to the same cache line. If you manually unroll so you only prefetch once per line, clang will auto-vectorize (but gcc won't, at least not the way I wrote it): https://godbolt.org/g/RCibxm. ICC doesn't do well, either. :/ But of course you usually shouldn't SW prefetch on sequential arrays at all for most modern x86 CPUs. — Peter Cordes, Jan 19 '18 at 20:59
@PeterCordes - that was sort of the point I was trying to make from the diff: it's easy to make a silly choice where the compiler knows far better than you do — Flexo, Jan 19 '18 at 21:25
I don't think your diff shows that, or good decision-making by the compiler. Putting the prefetch after the demand-load is 100% pointless. If gcc really understood prefetching and memory bandwidth, it would remove that prefetch and let the demand-load trigger the access to that cache line, instead of wasting load uops on prefetch instructions. Also, `__builtin_prefetch` breaks auto-vectorization, because gcc apparently doesn't how to combine it when unrolling, even with a `-march=` which tells it the cache-line size. (So a version with no prefetch would be faster because of auto-vec...) — Peter Cordes, Jan 19 '18 at 23:52
And BTW, your inline-asm version is a bit silly. Instead of asking for the address in a register and using a `(%0)` addressing mode manually, let the compiler choose the addressing mode. (Actually you used a `"g"` constraint, which would allow a pointer stored in memory -> asm syntax error.) Anyway, `asm volatile("prefetcht0 %0" : : "m"(d[i+64]));` looks reasonable, and lets the compiler choose an addressing mode to reference `d[i+64]`. (It's implicitly volatile with no outputs, but might as well make that explicit. So that obviously blocks unrolling) — Peter Cordes, Jan 19 '18 at 23:57

score 2 · Answer 2 · edited May 23 '17 at 10:27

You could add some PREFETCH* assembler instruction in some asm code, see How to Use Assembly Language in C code.

However, you should prefer (as Flexo answered) using the __builtin_prefetch because it is a compiler internal builtin (and it does accept two optional additional arguments after the address to prefetch) and the compiler knows more about it than what you give in your asm instruction. So it will probably optimize wiser the rest of your code accordingly.

See also this and that answers. Adding too much (or wrongly) some prefetch instructions can slow down your program, so you should use it with parsimony (and even perhaps not at all). Be sure to benchmark (and do ask for optimizations, e.g. gcc -O2 -march=native ...). Heuristically you want to prefetch data "in advance" (e.g. for the next 5 or 10 iterations).

Abdul Ahad · Answer 3 · 2018-01-18T07:39:48.623

-2

Assume that you have a time critical task (including usage of XOR) and that the compiler is not or never optimal.

I'll update this answer with the time measurements of a more complex problem when it's finished. This is the only answer that addresses the question, all of the other answers are essentially saying, "don't do it". See the quote from Agner below.

//  CLOCK_MONOTONIC_COARSE
//  CLOCK_MONOTONIC_RAW

#define DO_SOMETHING_ELSE_BEFORE_LOADING(i)     \
asm volatile (                                  \
        "movl        $1000000, %%ecx        ; " \
        "prefetcht0  (%%rax)                ; " \
        "for:                               ; " \
        "pxor        %%xmm0,  %%xmm1        ; " \
        "dec         %%ecx                  ; " \
        "jnz for                            ; " \
        "movdqa      (%%rax),  %%xmm0       ; " \
        :                                       \
        :                                       \
        : "%rax", "%ecx", "%xmm0", "%xmm1"      \
);

int main() {
    DO_SOMETHING_ELSE_BEFORE_LOADING(i)
    return 0;
}

The following looks like a good resource and essentially asserts the exact opposite of the other answers

http://www.agner.org/optimize/optimizing_assembly.pdf

It states

1. Optimizing code for speed. Modern C++ compilers generally optimize code quite well in most cases. But there are still cases where compilers perform poorly and where dramatic increases in speed can be achieved by careful assembly programming.

makefile

SET(CMAKE_CXX_FLAGS "-std=gnu++11 -march=native -mtune=native -msse2")

edited Jan 18 '18 at 07:39

answered Jan 17 '18 at 11:20

Abdul Ahad

826
8
16

1

You've not even managed to get the basic syntax of an asm block correct which makes me somewhat less inclined to accept the premise that you can do better than a modern compiler. Your answer also isn't legal C on a question that was asking about C. – Flexo Jan 17 '18 at 17:27
`decw` is word operand size, but `%ecx` is a dword register. More importantly, your inline asm doesn't have an input constraint for the pointer you're prefetching! So you prefetch from whatever garbage address was in `%rax` a lot of times... Also, your inline asm is using hard-coded registers, instead of leaving register allocation to the compiler. – Peter Cordes Jan 17 '18 at 18:21
Agner Fog's Optimizing Assembly guide doesn't claim that you can't get good asm from a C compiler. It's not unusual to write performance-critical code in C, and use prefetch intrinsics / builtins. If you're writing a whole loop in asm, you already have your pointer values in registers so there's no C question, you'd just write a `prefetcht0` instruction. Using inline asm for prefetch, and pure C for the rest of the code is unnecessary (although it seems the OP was thinking it was). But it really shouldn't hurt, if you use an `"m"` constraint to let the compiler choose addressing mode. – Peter Cordes Jan 17 '18 at 18:27
@PeterCordes - yeah, it's just an example of **prefetcht0** usage and then **doing something else** while the bits are arriving in L1 or whatever. the code does nothing, it just answers the question.. I'll update it with measurements after I am done. it will take a couple of hours as stated in the answer. I got pulled to something else – Abdul Ahad Jan 17 '18 at 18:27
so, regarding objections that advise against using inline assembly in favor of always using builtin intrinsics, theoretically, while(1) xor(); – Abdul Ahad Jan 17 '18 at 20:33
@Flexo, I updated the question https://stackoverflow.com/questions/47927158/is-it-possible-to-call-a-built-in-function-from-assembly-in-c if you think it's not possible to improve it and there is a bounty. thanks Peter and Flexo, exclusive of the ad hominem bunk – Abdul Ahad Jan 18 '18 at 01:59
@PeterCordes here's an example. I added a bounty https://stackoverflow.com/questions/47927158/is-it-possible-to-call-a-built-in-function-from-assembly-in-c – Abdul Ahad Jan 18 '18 at 02:04
1

Yes you can often beat the compiler *if* you know exactly what you're doing. But if you have a loop in C, and it would benefit from software prefetching (e.g. a binary search where you prefetch both possibilities for the comparison *after* this one, the 1/4 and 3/4 positions), you don't have to rewrite it in asm to add prefetching. If you're not going to do that, you shouldn't use inline asm for the prefetch either. (And if you do rewrite the whole thing in asm, inline or otherwise, then you already have the pointer in a register so input constraints aren't relevant). – Peter Cordes Jan 18 '18 at 04:22
yeah, the other question uses vaesenc of the bits to generate the next index @PeterCordes so essentially it's a random location every time. the original code I've already fixed uses nothing but intrinsics, exclusive of MULQ, but I'm trying to improve it. I think it's also possible to beat the compiler if you *don't* know what you are doing. the compiler simply gives up after 500 milliseconds or whatever, possibly based on bad advice.. this is basically a wrapper for XOR and the only answer that addresses the question – Abdul Ahad Jan 18 '18 at 07:32
.. assume it's a good question, and the only reason I am using this site is because of the *miracle of the Google Corporation* which is worth approximately a *baziiliion* dollars – Abdul Ahad Jan 18 '18 at 07:36
and the only thing I care about when reading this question via the *Googe meta information* is **how to use prefetcht0**, upvoting my own answer, this was address by the whole, "I think I'm right", "you think you're right" or "we're ready" and people should use their time researching XOR instead of downvoting the only question which doesn't dodge the question or engaging in sophistry in their c++ – Abdul Ahad Jan 18 '18 at 07:43
Inside inline asm, you write `prefetcht0 (%0)` (or `%[my_operand]` or whatever). Any details about constraints or using inline-asm in general isn't useful or interesting, because you *only* do it this way (instead of using `__builtin_prefetch`) if you already have an inline-asm block. If you're just using C without inline asm, you should use the `__builtin` function, or Intel's `_mm_prefetch` intrinsic. i.e. this question isn't a useful place to put a tutorial on using inline asm; there are already several good ones (see https://stackoverflow.com/tags/inline-assembly/info). – Peter Cordes Jan 18 '18 at 08:22
All that matters is that the object file contains a good sequence of instructions (if we're not containing maintainability or reusability, or ability to compile for future instruction sets / optimize for different CPUs) i.e. your use-case where you want to tune the crap out of something for your current CPU. It doesn't matter whether the compiler generated them, or whether the compiler just filled in operands in an asm template. If you can get the compiler to emit the instructions you want using builtin functions or even just portable C, there's no advantage to using inline asm. – Peter Cordes Jan 18 '18 at 08:25
it's interesting to me and anyone else searching google for "how to use inline-asm", see the other answer, I actually need to use inline assembly in the other question, and this is the **assembly** tag.. the question asks, "I am not aware of how to use it in C using inline assembly. It would be of great help if"... tyvm for the link.. this is a **google** inline-asm result – Abdul Ahad Jan 18 '18 at 08:28
Anyway, your claims that the other answers are wrong are totally bogus. Read the question carefully: the OP has some C code (which doesn't already use inline-asm, as far as I can tell), and wants to add prefetching. This is the normal case for most people, which is why I upvoted the other answers. Yours would be ok if you didn't go on a big rant about the others being wrong and inline asm being the *only* way to go. – Peter Cordes Jan 18 '18 at 08:28
@PeterCordes I basically stop reading the other answers after the assertion/thesis "don't use inline-asm, it is always wrong, use a builtin function instead because the compiler is always better because it is so awesome and you don't know exactly what you are doing" ... How to use PREFETCHT0 Instruction in my C code? – Abdul Ahad Jan 18 '18 at 08:31
Often people asking questions have some idea of what they think the answer will be, and phrase the question that way. If they go too far down that road, it leads to [the X-Y problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). You can tell from the fact **the OP accepted the `__builtin_prefetch` answer** (and not Basile's answer with a link to using inline asm) that it *was* what they were looking for and did answer their real question. If you got here searching for something else, that's unfortunate that it wasted your time. But this question is about C. – Peter Cordes Jan 18 '18 at 08:33
I just don't think XOR is a problem and this is the only answer to the actual question.. it's useful for a google search for anyone that actually wants an answer to the problem.. is this site used to help the OP only? why not delete the question immediately after you have served the OP? the tags are: c linux assembly x86 inline-assembly – Abdul Ahad Jan 18 '18 at 08:35
1

If you'd bothered to keep reading, you would have seen that the top answer shows that `__builtin_prefetch` *does* get gcc to emit `prefetcht0` when compiling for x86, by showing the compiler's asm output. This is what you want, whether you do it with inline asm or with a builtin. There are times when hand-written asm is worth it, but most assembly / performance experts *do* recommend avoiding inline asm whenever you can get equivalent results without it. Writing a whole stand-alone function in asm is often good, BTW. That's what glibc does for `strlen`, `memcpy`, and so on, not inline asm. – Peter Cordes Jan 18 '18 at 08:37
yeah, i know that, but again, I have to parse through 500 lines of sophistry to find the prefetcht0 instruction.. in my answer: asm volatile ( \ "movl $1000000, %%ecx ; " \ "prefetcht0 (%%rax) ; " \ "movdqa (%%rax), %%xmm0 ; " \ ... I really don't have time for english, I'm just reading the code anyway – Abdul Ahad Jan 18 '18 at 08:38
1

Huh, I hadn't noticed that `__builtin_prefetch` breaks auto-vectorization of this loop! So does `_mm_prefetch(d+i+64, _MM_HINT_T0);` (because presumably the immintrin.h implementation uses `__builtin_prefetch`). But so does inline asm inside a loop that's otherwise C. https://godbolt.org/g/gfrwzx. So it turns out gcc and clang do suck at prefetch, but you could still manually vectorize with intrinsics and avoid inline asm if you want. Of course, SW prefetch for sequential array access is usually not helpful anyway; HW prefetch detects that pattern. – Peter Cordes Jan 18 '18 at 12:10

How to use PREFETCHT0 Instruction in my C code?

3 Answers3