6

I use it like this.

__pld(pin[0], pin[1], pin[2], pin[3], pin[4]);

But I get this error.

undefined reference to `__pld'

What am I missing? Do I need to include a header file or something? I am using ARM Cortex A8, does it even support the pld instruction?

MetallicPriest
  • 29,191
  • 52
  • 200
  • 356
  • What compiler? Is that supposed to be inline assembler? – Prof. Falken Apr 16 '13 at 08:31
  • That'll depend on your compiler. PLD is an assembler instruction. Example, see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0472c/Cihbeecj.html – paxdiablo Apr 16 '13 at 08:31
  • (__builtin_prefetch) answer is good. you should also check binary if it has the pld instructions (use objdump). preload (pld) things a few times ahead of what you are accessing. address you are reading should also be aligned with cache line. it is hard to get pld right. most of the time cpu/core does it for you as well. so you may not see a gain from this. – auselen Apr 16 '13 at 09:31

4 Answers4

7

As shown in this answer, you can use inline assembler as per Clark. __builtin_prefetch is also a good suggestion. An important fact to know is how the pld instruction acts on the ARM; for some processors it does nothing. For others, it brings the data into the cache. This is only going to be effective for a read operation (or read/modify/write). The other thing to note, is that if it does work on your processor, it fetches an entire cache line. So the example of fetching the pin array, doesn't need to specify all members.

You will get more performance by ensuring that pld data is cache aligned. Another issue, from seeing the previous code, you will only gain performance with variables you read. In some cases, you are just writing to the pin array. There is no value in prefetching these items. The ARM has a write buffer, so writes are batched together and will burst to an SDRAM chip automatically.

Grouping all read data together on a cache line will show the most performance improvement; the whole line can be pre-fectched with a single pld. Also, when you un-roll a loop, the compiler will be able to see these reads and will schedule them earlier if possible so that they are filled in the cache; at least for some ARM cpus.

Also, you may consider,

 __attribute__((optimize("prefetch-loop-arrays")))

in the spirit of the accepted answer to the other question; probably the compiler will have already enabled this at -O3 if it is effective on the CPU you have specified.

Various compiler options can be specified with --param NAME=VALUE that allow you to give hints to the compiler on the memory sub-system. This could be a very potent combination, if you get the parameters correct.

  • prefetch-latency
  • simultaneous-prefetches
  • l1-cache-line-size
  • l1-cache-size
  • l2-cache-size
  • min-insn-to-prefetch-ratio
  • prefetch-min-insn-to-mem-ratio

Make sure you specify a -mcpu to the compiler that supports the pld. If all is right, the compiler should do this automatically for you. However, sometime you may need to do it manually.

For reference, here is gcc-4.7.3's ARM prefetch loop arrays code activation.

  /* Enable sw prefetching at -O3 for CPUS that have prefetch, and we have deemed
     it beneficial (signified by setting num_prefetch_slots to 1 or more.)  */
  if (flag_prefetch_loop_arrays < 0
      && HAVE_prefetch
      && optimize >= 3
      && current_tune->num_prefetch_slots > 0)
    flag_prefetch_loop_arrays = 1;
artless noise
  • 21,212
  • 6
  • 68
  • 105
  • With gcc-4.7.3, only the **ARM Cortex-A9** will have **num_prefetch_slots** set by default. – artless noise Apr 17 '13 at 00:36
  • Sorry, I missed 'I am using ARM Cortex A8'. This processor supports `pld`; any ARM better than armv5 does. This just means the CPU will decode it. I don't know if the 'Cortex-A8' does anything. The Cortext A8 data sheet would say. – artless noise Apr 17 '13 at 22:47
  • Some ARM v5+ CPUs accept this instruction, but it is basically a `NOP`. For instance, there are ARM920 CPUs like this. Here, the instruction is actually slightly harmful as it pollutes the instruction stream and does nothing. Code written for another CPU that does use `PLD` will work on these CPUs. However, just because an ARM CPU *supports* `PLD` doesn't mean it is effective at anything. – artless noise May 31 '16 at 12:32
4

Try http://www.ethernut.de/en/documents/arm-inline-asm.html

In GCC it might look like this:

Example from: http://communities.mentor.com/community/cs/archives/arm-gnu/msg01553.html and a usage of pld:

   __asm__ __volatile__(
    "pld\t[%0]"
    :
    : "r" (first) );
Prof. Falken
  • 24,226
  • 19
  • 100
  • 173
  • @MetallicPriest I guess that is why it's named first, FYI prefetch doesn't not always optimize things. – 0x90 Apr 16 '13 at 08:45
  • @MetallicPriest, I have no real idea, but it looks like it. %0 would mean the first argument. – Prof. Falken Apr 16 '13 at 08:45
  • 1
    The use of the variable named `first` is because the instruction will prefetch a cache line. If the processor has no cache, pre-load would be meaningless. On modern CPU L1 cache is 0-5 wait states (clock delays), L2 maybe 10-100 and main memory is typically 100-1000. As code can anticipate the need of future memory reads, we need to move memory from 'core' or main memory to either L2 or L1. So any cache line value works, but the first is typically used. It is the concept behind [tag:halide], which is billed as computational photography, but can be used for many other memory bound domains. – artless noise Oct 02 '21 at 13:12
3

You may want to look at gcc's __builtin_prefetch. I reproduced it here for your convenience:

This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.

The value of addr is the address of the memory to prefetch. There are two optional arguments, rw and locality. The value of rw is a compile-time constant one or zero; one means that the prefetch is preparing for a write to the memory address and zero, the default, means that the prefetch is preparing for a read. The value locality must be a compile-time constant integer between zero and three. A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three.

     for (i = 0; i < n; i++)
       {
         a[i] = a[i] + b[i];
         __builtin_prefetch (&a[i+j], 1, 1);
         __builtin_prefetch (&b[i+j], 0, 1);
         /* ... */
       }

Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p->next will not fault if p->next is not a valid address, but evaluation will fault if p is not a valid address.

If the target does not support data prefetch, the address expression is evaluated if it includes side effects but no other code is generated and GCC does not issue a warning.

John Szakmeister
  • 44,691
  • 9
  • 89
  • 79
  • Here is [godbolt on ARM](https://godbolt.org/#compilers:!((compiler:armhfg482,options:'-O3',source:'void+prefetch_test(int+*+a,+int+*b,+const+int+n,+const+int+j)%0A%7B%0A++int+i%3B%0A++for+(i+%3D+0%3B+i+%3C+n%3B+i%2B%2B)+%7B%0A+++++a%5Bi%5D+%3D+a%5Bi%5D+%2B+b%5Bi%5D%3B%0A+++++__builtin_prefetch+(%26a%5Bi%2Bj%5D,+1,+1)%3B%0A+++++__builtin_prefetch+(%26b%5Bi%2Bj%5D,+0,+1)%3B%0A++%7D%0A%7D')),filterAsm:(commentOnly:!t,directives:!t,labels:!t),version:3) – artless noise May 31 '16 at 12:47
0
undefined reference to `__pld'

To answer the question about the undefined reference, __pld is an ARM compiler intrinsic. See __pld intrinsic in the ARM manual.

Perhaps GCC does not recognize the ARM instrinsic.

jww
  • 97,681
  • 90
  • 411
  • 885