17

I am trying to vectorize a loop, computing dot product of a large float vectors. I am computing it in parallel, utilizing the fact that CPU has large amount of XMM registers, like this:

__m128* A, B;
__m128 dot0, dot1, dot2, dot3 = _mm_set_ps1(0);
for(size_t i=0; i<1048576;i+=4) {
    dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
    dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
    dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
    dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
}
... // add dots, then shuffle/hadd result.

I heard that using prefetch instructions could help speedup things, as it could fetch further data "in background", while doing muls and adds on a data that is in cache. However i failed to find examples and explanations on how to use _mm_prefetch(), when, with what addresses, and what hits. Could you assist on this?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
xakepp35
  • 2,878
  • 7
  • 26
  • 54
  • 2
    Usually hardware prefetch does a good job for sequential access, and you don't need software prefetch. See also [What Every Programmer Should Know About Memory?](https://stackoverflow.com/questions/8126311/what-every-programmer-should-know-about-memory/47714514#47714514). – Peter Cordes Feb 26 '18 at 18:08
  • 1
    You might consider using 8 accumulators, though, instead of only 4, to better hide the latency of `addps` (especially on Skylake where it's 4 cycle latency). x86-64 has 16 xmm registers (or ymm with AVX). And *especially* if you're compiling with FMA so those mul/add operations can collapse into a single FMA. More details [on another SSE/AVX dot-product question](https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles-on-haswell-different-from-agners-instruction). This might not help if you're mostly memory bottlenecked, though. – Peter Cordes Feb 26 '18 at 18:13
  • @Peter Cordes You are best person of the week! What i read from you, taught me the right way of doing things. By the way, if i insist on using software prefetcher - where and which instrictions should i put in the described example? and why? Just for the sake of studying it. – xakepp35 Feb 26 '18 at 18:56
  • 1
    You probably want to use `prefetchnta` if A and B are large and won't be read again soon. You want to prefetch once per 64B cache line, and you'll need to tune how far ahead to prefetch. e.g. `_mm_prefetch((char*)(A+64), _MM_HINT_NTA);` and the same for B would prefetch 16*64 = 1024 bytes head of where you're loading, allowing for hiding some of the latency of a cache miss but still easily fitting in L1D. Read Ulrich Drepper's "What Every Programmer Should Know About Memory?" (the PDF in my earlier link) for more about caching and prefetch. – Peter Cordes Feb 26 '18 at 19:02
  • 1
    See also: https://stackoverflow.com/questions/3122915/the-prefetch-instruction – Paul R Feb 26 '18 at 19:14
  • @Peter Cordes Ok, non temporal prefetch, for data used once. But why it would prefetch 64 bytes for A, but 1024 bytes for B? And how far should i prefetch? I assume that one instruction should be in the beginning of a loop, and another in the middle, because 2 nearby prefetch instructions works bad? – xakepp35 Feb 26 '18 at 19:16
  • @xakepp35: I forgot you were indexing into `A` with `A[i]`, rather than incrementing a pointer with `A+=4`, so my code left out the `i`. You put the prefetch in the loop. See BeeOnRope's answer. – Peter Cordes Feb 26 '18 at 19:27
  • 2
    @PeterCordes My experiments on Skylake X have convinced me not to recommend `prefetchnta` anymore unless you really *really* know what you're doing. And I emphasize the "really" part because even I cannot consider myself competent enough to use it properly without it backfiring on Skylake X. – Mysticial Feb 26 '18 at 19:36
  • 2
    If you `prefetchnta` and it gets evicted before it is used (either because it's too early or because the other hyperthread is misbehaving), it gets evicted from all levels of cache and then refetched from memory on the actual read. The result is 2x the bandwidth cost and a massive slowdown on anything that's bandwidth-bound. At least `prefetcht0` seems to get evicted into the higher level caches. – Mysticial Feb 26 '18 at 19:36
  • @Mysticial: ah, interesting, so it's probably able to bypass L3 as well as L2, because L3 isn't inclusive anymore. That's good for the best case, but yeah it would make it even more brittle. That's always been a problem for SW prefetch: the right prefetch distance depends on the HW details and load conditions. – Peter Cordes Feb 26 '18 at 19:40
  • @PeterCordes It might be related to the new cache design. Or it could be that they changed the implementation of `prefetchnta` to *really* only pull into L1. (Since I've mentioned in the past that I saw no difference between `prefetcht0` and `prefetchnta` on older processors.) – Mysticial Feb 26 '18 at 19:52
  • Either way, in my use cases, I often prefetch around 100 - 1000 instructions ahead of time. (this is not very configurable since they are often locked to the cache blocking sizes) The problem that makes it almost impossible to `prefetchnta` is that when the code becomes bandwidth-bound, even 100 instructions can stretch out to thousands of cycles. This gives a very large window for the other hyperthread to destroy the L1. – Mysticial Feb 26 '18 at 19:52
  • @Mysticial: Right, but with inclusive L3 on previous uarches, a design that bypassed L3 wasn't even possible. – Peter Cordes Feb 26 '18 at 19:53
  • Yeah, exactly: it's almost certainly a consequence of the L3 being a victim cache in SKL: if the line never gets into L2 (as was true even before SKLX, AFAICT), it's never going to get into L3 in a victim-cache design, I guess. So yeah you can say NTA is working even more "as designed" (truly pulled only into the L1) but yup that's super brittle especially if your algorithm has any chance of even occasional conflict misses in L1. – BeeOnRope Feb 26 '18 at 20:04
  • @Mysticial - I added your observation to a new SKLX section [on this answer](https://stackoverflow.com/a/47959255/149138) since I think it's an important change. If you have any other details, feel free to share! I haven't had the chance to test anything on SKLX, so I'm using observations like this to fill in my knowledge in the meantime. – BeeOnRope Feb 26 '18 at 20:35
  • @BeeOnRope For fanatics like us, if you can afford it, I'd highly recommend just getting one of these Skylake X boxes. It's a completely different world from even desktop Skylake. Intel has confirmed that even the 6/8 core models have both FMAs. So you can get a low-end (yet fully functional) SKX system for under $1000 USD. We're not even a year in and I already have two of these systems at home - a 10-core from launch day and a 14-core engineering sample that was recently given to me. And even between these two systems, that are large differences due to the different die sizes. – Mysticial Feb 26 '18 at 20:55
  • I'm traveling so unfortunately getting a desktop is not an option for me. I actually waited for Skylake laptops because at one point it seemed they'd have AVX512 but of course that didn't pan out. Still, just using a cloud box is one reasonable option (but with a lot of limitations like not being able to access the full PMU). – BeeOnRope Feb 26 '18 at 20:58
  • @BeeOnRope Ah that's unfortunate... I'd say you'd also need more than just the PMU. You'll also need the overclocking options to properly control the various clocks for certain types of benchmarks/measurements. – Mysticial Feb 26 '18 at 21:16
  • @Mysticial - yes, that's ideal, but I have found that I can still get results which are at least interesting in a stochastic sense with lots of iterations, some application of stats, etc. Some types of investigations are just more-or-less impossible on cloud hardware though. – BeeOnRope Feb 26 '18 at 21:17

1 Answers1

33

The short answer that probably works for perfectly linear streaming loops like yours is probably: don't use them at all, let the hardware prefetchers do the work.

Still, it's possible that you can speed things up with software prefetching, and here is the theory and some detail if you want to try...

Basically you call _mm_prefetch() on an address you'll need at some point in the future. It is similar in some respects to loading a value from memory and doing nothing with it: both bring the line into the L1 cache2, but the prefetch intrinsic, which under the covers is emitting specific prefetch instructions, has some advantages which make it suitable for prefetching.

It works at cache-line granularity1: you only need to issue one prefetch for each cache line: more is just a waste. That means that in general, you should try to unroll your loop enough so that you can issue only one prefetch per cache line. In the case of 16-byte __m128 values, that means unroll at least by 4 (which you've done, so you are good there).

Then simple prefetch each of your access streams by some PF_DIST distance ahead of the current calculation, something like:

for(size_t i=0; i<1048576;i+=4) {
    dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
    dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
    dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
    dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
    _mm_prefetch(A + i + PF_A_DIST, HINT_A);
    _mm_prefetch(B + i + PF_B_DIST, HINT_B);
}

Here PF_[A|B]_DIST is the distance to prefetch ahead of the current iteration and HINT_ is the temporal hint to use. Rather than try to calculate the right distance value from first principles, I would simply determine good values of PF_[A|B]_DIST experimentally4. To reduce the search space, you can start by setting them both equal, since logically a similar distance is likely to be ideal. You might find that only prefetching one of the two streams is ideal.

It is very important that the ideal PF_DIST depends on the hardware configuration. Not just on the CPU model, but also on the memory configuration, including details such as the snooping mode for multi-socket systems. For example, the best value could be wildly different on client and server chips of the same CPU family. So you should run your tuning experiment on the actual hardware you are a targeting, as much as possible. If you target a variety of hardware, you can test on all the hardware and hopefully find a value that's good on all of them, or even consider compile-time or runtime dispatching depending on CPU type (not always enough, as above) or based on a runtime test. Now just relying on hardware prefetching is starting to sound a lot better, isn't it?

You can use the same approach to find the best HINT since the search space is small (only 4 values to try) - but here you should be aware than the difference between the different hints (particularly _MM_HINT_NTA) might only show as a performance difference in code that runs after this loop, since they affect how much data unrelated to this kernel remain in the cache.

You might also find that this prefetching doesn't help at all, since your access patterns are perfectly linear and likely to be handled well by the L2 stream prefetchers. Still there are some additional, more hardcode things you could try or consider:

  • You might investigate whether prefetching only at the start of 4K page boundaries helps3. This will complicate your loop structure: you'll probably need a nested loop to separate the "near edge of page" and "deep inside the page" cases in order to only issue the prefetches near page boundaries. You'll also want to make your input arrays page-aligned too, or else it gets even more complicated.
  • You can try disabling some/all of the hardware prefetchers. This is usually terrible for overall performance, but on a highly tuned load with software prefetching, you might see better performance by eliminating interference from hardware prefetching. Selecting disabling prefetching also gives you an important a key tool to help understand what's going on, even if you ultimately leave all the prefetchers enabled.
  • Make sure you are using huge pages, since for large contiguous blocks like this they are idea.
  • There are problems with prefetching at the beginning and end of your main calculation loop: at the start, you'll miss prefetching all data at the start of each array (within the initial PF_DIST window), and at the end of the loop you'll prefetch additional and PF_DIST beyond the end of your array. At best these waste fetch and instruction bandwidth, but they may also cause (ultimately discarded) page faults which may affect performance. You can fix both by special intro and outro loops to handle these cases.

I also highly recommend the 5-part blog post Optimizing AMD Opteron Memory Bandwidth, which describes optimizing a problem very similar to yours, and which covers prefetching in some detail (it gave a large boost). Now this is totally different hardware (AMD Opteron) which likely behaves differently to more recent hardware (and especially to Intel hardware if that's what you're using) - but the process of improvement is key and the author is an expert in the field.


1 It may actually work at something like 2-cache-line granularity depending on how it interacts with the adjacent cache line prefetcher(s). In this case, you may be able to get away with issuing half the number of prefetches: one every 128 bytes.

2 In the case of software prefetch, you can also select some other level of cache, using the temporal hint.

3 There is some indication that even with perfect streaming loads, and despite the presence of "next page prefetchers" in modern Intel hardware, page boundaries are still a barrier to hardware prefetching that can be partially alleviated by software prefetching. Maybe because software prefetch serves as a stronger hint that "Yes, I'm going to read into this page", or because software prefetch works at the virtual address level and necessarily involves the translation machinery, while L2 prefetching works at the physical level.

4 Note that the "units" of the PF_DIST value is sizeof(__mm128), i.e., 16 bytes due to the way I calculated the address.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Wouldnt it be better to put that instructions in the beginning of the loop, so that it fetches data for next cycle step at the beginning of current? And what is PF_DIST, is it = 4? (for next) Also, does interleaving affects? i heard that consequent prefetch instructions could behave worse, so if you emit one in the beginning and another in the middle of a loop your chances not overfilling prefetch queue are better? – xakepp35 Feb 26 '18 at 19:22
  • @xakepp35 - it doesn't really matter if you put it at the start or at the end: imagine the instruction stream will look like `LPLPLPLPL` or `PLPLPLPLPL` in the "at end" and "at start" cases, respecitvely - where `L` is the main work you do in your original loop and `P` is the prefetch instructions - so the pattern is identical except for the first letter, and the thing is so long the difference is microscopic (and `PF_DIST` tuning will tend to equalize the approaches in terms of which line arrives when). – BeeOnRope Feb 26 '18 at 19:32
  • @xakepp35: A reasonable guess for PF_DIST would be 64, i.e. a prefetch distance of 1024 bytes, like I suggested in comments on the question. Also try 56 and 80, to see whether a larger or smaller prefetch distance is better (on your specific hardware with that CPU and that ratio of CPU clock to memory bandwidth, and probably with the rest of the system idle). All of those factors affect the optimal prefetch distance, which is why tuning SW prefetch is so brittle, and why good HW prefetch was an essential feature for robust performance (which is why modern CPUs have it.) – Peter Cordes Feb 26 '18 at 19:32
  • `PF_[A|B]_DIST` is explained in the answer: it's the "prefetch ahead" distance and is usually going to be more than 4. You want to prefetch ahead "enough" that the cache line arrives before you need it, but not so far ahead that it gets evicted from cache before you use it. Rather than trying to give some number that will work, I just suggest determining it experimentally: write a script to compile for various PF distances and graph the output: there is usually an obvious plateau of "good values" you can choose from. – BeeOnRope Feb 26 '18 at 19:33
  • @BeeOnRope: interesting idea to rely on the L1D adjacent-line prefetcher. But probably a bad idea for NT prefetch: the resulting prefetches won't have an NT hint (and will pollute L2 and maybe trigger prefetching into L2 for lines that area already in L1D but bypassed L2). And will probably be triggered while the `prefetchnta` is significantly farther ahead, so you might not get as good locality within DRAM pages. Still, worth considering if you're on the edge of a uop throughput or load-port bottleneck. Or if you want it to be less of a disaster on IvyBridge. – Peter Cordes Feb 26 '18 at 19:36
  • Oops, NVM, the spatial prefetcher is in L2, not L1D. So yeah, that could be good for `prefetcht0` or `prefetcht1`. With `t0`, the L1D prefetcher might bring lines from L2 into L1, or maybe you'd alternate L1D hit with L2 hit, and out-of-order execution could hide that. It might get more throughput out of the 10 LFBs overall. – Peter Cordes Feb 26 '18 at 19:47
  • @PeterCordes - BTW I have a hard time believing that "1 prefetch every 43 cycles" on IvyBridge is entirely true as stated: there are plenty of small loops with prefetch in common software (including in library routines like `memcpy`), and that would easily show up as a multi-100% performance regression if it was really that bad. So I suspect there is some subtlety there like: "only if the prefetch actually misses in [cache level X]" or "prefetch is silently dropped (but doesn't block CPU) if issued more than once every 43 cycles" or "only applies if the LFBs are full" etc. – BeeOnRope Feb 26 '18 at 19:48
  • @BeeOnRope: I've seen at least one mention here on SO that removing SW prefetch does give real speedups on IvB specifically, from either Harold or Mysticial IIRC. I agree probably not as bad as 1 per 43 cycles in real cases, though. – Peter Cordes Feb 26 '18 at 19:52
  • @PeterCordes - yeah I am actually not sure how the various prefetchers interfact with the various software prefetching instuctions, so I was throwing it out there as something to consider if you really get down in the weeds on this. The point about NTA is a good one: it would be stupid if NTA triggered hardware prefetches that effectively ruined the NTA nature of the developer-included prefetch instructions, and I think we can be pretty sure it doesn't do that (and my tests seem to indicate it doesn't) - but the mechanism isn't clear. – BeeOnRope Feb 26 '18 at 19:52
  • @PeterCordes - I have no doubt there is an IvB prefetching bug, but I think it can't be as bad as every prefetch instruction effectively having a latency of 43 cycles (with consecutive PFs in a program considered dependent). That would mean any loop with any prefetch would run at best 1 per 43 cycles! That would cause at least a few giant regressions in some software, so I think it's more subtle. It deserves some testing, but I don't have the hardware. – BeeOnRope Feb 26 '18 at 19:53
  • @Peter Cordes often i have several operations separated in two stages, first stage is doing gather-like IO, by combining memories from different locations with different offsets into contigous consistent aligned __m128 arrays. and second is doing heavy calcs, feeded with these assembled arrays. is it good to issue prefetch instruction for the start of arrays, in the end of assembleng? is it a good practice? is it good to process data in reverse direction (as last elements would likely be in cache to the end of assembleng)? – xakepp35 Feb 27 '18 at 03:35
  • 1
    It depends on how large your arrays are: if the arrays are comparable or smaller than the size of the LLC, then sure you might get some benefit from doing things backwards since a reasonable portion of the data from the last processing might be in the cache (but this is also complicated by new LLC features that try to detect streaming loads and change the replacement policy of the cache when they are detected). The more general approach to make that work is simply to block up your processing: instead of doing each phase entirely, try to alternate between the ... – BeeOnRope Feb 27 '18 at 03:38
  • ... phases so that you are able to work on already-cached data. That's getting to the level of complexity of a whole new question though. – BeeOnRope Feb 27 '18 at 03:39
  • @xakepp35: 2-pass gather and then compute sounds like a bad idea. The gathering will barely go any faster if you store the results vs. feeding them to computation, so you could already be done your computation by the time you're done gathering. (If you have multiple stages of computation, you could store the gathered data *as well* as run the first stage of computation on it). Gather first might have advantages if gathering in the order you need for compute would have worse spatial locality. But overall you want to maximize computational intensity (more compute for every time you load data) – Peter Cordes Feb 27 '18 at 04:06
  • @Peter Cordes not to overflood with comments and simple questions i have created room: https://chat.stackoverflow.com/rooms/165865/cpu-vectorization – xakepp35 Feb 27 '18 at 05:01
  • @BeeOnRope is there an intuition for why in the [AMD Read Bandwidth Benchmarks](http://sites.utexas.edu/jdm4372/2010/11/) reading two pages in "parallel" helps so much? The only thing I can think of is that it might be a better access pattern to DRAM but maybe there is something else going on. – Noah Feb 27 '21 at 19:41
  • 1
    @Noah - did you look at part 4 in that series? It has some suggestions as to why multiple streams might be useful. I believe Dr McCalpin reported elsewhere that multiple streams helps also on Intel because the l2 prefetchers aren't able to generate enough outstanding requests from a single stream (they are limited by 4k boundaries among other things), so multiple streams helps in that way too. – BeeOnRope Feb 27 '21 at 20:22
  • @BeeOnRope [The system agent description of sandybridge in Intel Arch manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf#page=59): "For best performance, populate both channels with equal amounts of memory, preferably the exact same types of DIMMs. In addition, using more ranks for the same amount of memory, results in somewhat better memory bandwidth, since more DRAM pages can be open simultaneously..." might help explain why interleaving 4k pages helps. – Noah Jun 21 '21 at 15:54
  • @Noah - but the mapping from physical address to "RAM address" isn't based on 4k pages: rather it is usually much more fine grained than that. It changes with the hardware, but usually the bits that select the channel are something like bits 6-8 (zero based), so that that channels may alternate every 64-258 bytes. So you don't need to have streams in different pages to get the benefit of using both channels at once. I have some links on DRAM address mapping [here](https://github.com/travisdowns/uarch-bench/wiki/Links#dram). – BeeOnRope Jun 21 '21 at 21:16
  • That said, the access pattern definition interacts with RAM in important ways. Dr McCalpin talks about that in a few places. – BeeOnRope Jun 21 '21 at 21:18
  • @BeeOnRope hmm, based on the sandy bridge physical address mapping I could it seems that next 4x pages will all map to the same dimm/rank/bank and the 12/13 bit will vary the column offset on an already open row. This would mean, however, that you should be able to aggressively pessimize the results (on sandy bridge at least) by having the memory region no 4 * PAGE_SIZE aligned because then you will be essentially ping ponging open rows. – Noah Jun 21 '21 at 21:30
  • @Noah - where do you see that? I recall the channel interleaving was much more fine grained that that. If it worked like that, throughput for small regions of < 4 pages would be terrible because only 1 channel would usually be lit up, halving throughput. – BeeOnRope Jun 21 '21 at 21:36
  • @BeeOnRope [This link on SnB](http://lackingrhoticity.blogspot.com/2015/05/how-physical-addresses-map-to-rows-and-banks.html). bit 6 is channel interleaving, but bits [7-13] are for column location and bits [0-5] ^ [14-16] for row. So if bits [0-5] and [14-16] stay constant, and the only bits you change are [7-13] hypothetically you should be hitting the same row on up to 4x pages (assuming 4x page alignment) which IIRC saves some calls to bringing the row down for reading. The thing is I looked back at some old tests on SKL/ICL and saw 1kb/2kb interleaving didnt do so well. So must be more. – Noah Jun 21 '21 at 21:48
  • Basically saying it may have to do with "open row" hitrate (but am skeptical because seems 4k is kind of special). Not sure of any uncore events that can test that. Another possibility for systems that have multiple page walkers it gets some parallelism there? But then 4x wouldn't be expected to perform better the 2x which is not what was observed in the origional link. Also unsure if the test cpus have multiple page-walkers (know far less about AMD). – Noah Jun 21 '21 at 21:51
  • 1
    @Noah - I don't think that's what that post says. It says row number are in bits 18-32. Most of the parallelism at DRAM comes from using both channels. A small amount of additional parallelism comes from having more than one bank/rank (bits 14-17 in the post you linked) active, although I think this depends on the timing. E.g., if you read all the columns in a row back-to-back (ignoring channels, that happens by default for linear access with this mapping since the column is in bits 0-5, 7-13) that's already pretty efficient since you have almost 100% open row hit rate. \ – BeeOnRope Jun 21 '21 at 21:56
  • 1
    You only need to open a new row every 2048 bytes or whatever the row size is. Now there is *some* latency in opening the new row, so a slightly more efficient strategy might be to have two rows active (in different rank/banks of course!), offset by say 1024 bytes, so that while one bank/rank is changing its open row, you are still going full steam on the other. The memory controller itself reorders stuff heavily, so I don't know if the offset is actually needed. I haven't really played with this stuff yet so I'm mostly guessing. – BeeOnRope Jun 21 '21 at 21:58
  • 2
    My impression is that DRAM mapping and fine-grained memory controller behavior is much more important for random or semi-random access than linear access, as the latter basically works well by default and can achieve close to the DRAM bandwidth w/o any tricks (e.g., NT stores), although it may have been different on older hardware. – BeeOnRope Jun 21 '21 at 22:00
  • @BeeOnRope I was observing speedup with 2x/4x page interleaving on SKL/ICL (~10-15%) so I think there is some aspect of it that still is relevant. – Noah Jun 22 '21 at 00:53
  • 1
    @Noah - right, I'm not suggesting that interleaving access to pages doesn't speed things up, only that such a speedup doesn't obviously come from the way DDR works. The usual explanation is closer to the core, e.g.,: having > 1 stream in different pages allows more prefetching to occur (since there's 1 stream per page, IIRC), and/or is able to hide the gaps in prefetching that occur at 4k boundaries (b) touching multiple pages allows page walking latency to be hidden: although it may mean that the streams should have an offset within a page to accomplish this. – BeeOnRope Jun 22 '21 at 01:46