13

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.

If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some time, and some write combining can occur under both mechanisms (more likely with writeback). And I know how this interacts with other cores accessing the same cache-line of data - cache snooping etc.

My question is, if the store matches precisely the value already in the cache, if not a single bit is flipped, does any Intel micro-architecture notice this and NOT mark the line as dirty, and thereby possibly save the line from being marked as exclusive, and the writeback memory overhead that would at some point follow?

As I vectorise more of my loops, my vectorised-operations compositional primitives don't explicitly check for values changing, and to do so in the CPU/ALU seems wasteful, but I was wondering if the underlying cache circuitry could do it without explicit coding (eg the store micro-op or the cache logic itself). As shared memory bandwidth across multiple cores becomes more of a resource bottleneck, this would seem like an increasingly useful optimisation (eg repeated zero-ing of the same memory buffer - we don't re-read the values from RAM if they're already in cache, but to force a writeback of the same values seems wasteful). Writeback caching is itself an acknowledgement of this sort of issue.

Can I politely request holding back on "in theory" or "it really doesn't matter" answers - I know how the memory model works, what I'm looking for is hard facts about how writing the same value (as opposed to avoiding a store) will affect the contention for the memory bus on what you may safely assume is a machine running multiple workloads that are nearly always bound by memory bandwidth. On the other hand an explanation of precise reasons why chips don't do this (I'm pessimistically assuming they don't) would be enlightening...

Update: Some answers along the expected lines here https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization but still an awful lot of speculation "it must be hard because it isn't done" and saying how doing this in the main CPU core would be expensive (but I still wonder why it can't be a part of the actual cache logic itself).

Update (2020): Travis Downs has found evidence of Hardware Store Elimination but only, it seems, for zeros and only where the data misses L1 and L2, and even then, not in all cases. His article is highly recommended as it goes into much more detail.... https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html

Update (2021): Travis Downs has now found evidence that this zero store optimisation has recently been disabled in microcode... more detail as ever from the source himself https://travisdowns.github.io/blog/2021/06/17/rip-zero-opt.html

Tim
  • 916
  • 7
  • 21
  • 3
    The answers on https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization are mostly terrible, especially the currently accepted one shows a lack of understanding of caches / CPU registers. – Peter Cordes Nov 22 '17 at 00:23

3 Answers3

7

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores.

There has been academic research on this and there is even a patent on "eliminating silent store invalidation propagation in shared memory cache coherency protocols". (Googling '"silent store" cache' if you are interested in more.)

For x86, this would interfere with MONITOR/MWAIT; some users might want the monitoring thread to wake on a silent store (one could avoid invalidation and add a "touched" coherence message). (Currently MONITOR/MWAIT is privileged, but that might change in the future.)

Similarly, such could interfere with some clever uses of transactional memory. If the memory location is used as a guard to avoid explicit loading of other memory locations or, in an architecture that supports such (such was in AMD's Advanced Synchronization Facility), dropping the guarded memory locations from the read set.

(Hardware Lock Elision is a very constrained implementation of silent ABA store elimination. It has the implementation advantage that the check for value consistency is explicitly requested.)

There are also implementation issues in terms of performance impact/design complexity. Such would prohibit avoiding read-for-ownership (unless the silent store elimination was only active when the cache line was already present in shared state), though read-for-ownership avoidance is also currently not implemented.

Special handling for silent stores would also complicate implementation of a memory consistency model (probably especially x86's relatively strong model). Such might also increase the frequency of rollbacks on speculation that failed consistency. If silent stores were only supported for L1-present lines, the time window would be very small and rollbacks extremely rare; stores to cache lines in L3 or memory might increase the frequency to very rare, which might make it a noticeable issue.

Silence at cache line granularity is also less common than silence at the access level, so the number of invalidations avoided would be smaller.

The additional cache bandwidth would also be an issue. Currently Intel uses parity only on L1 caches to avoid the need for read-modify-write on small writes. Requiring every write to have a read in order to detect silent stores would have obvious performance and power implications. (Such reads could be limited to shared cache lines and be performed opportunistically, exploiting cycles without full cache access utilization, but that would still have a power cost.) This also means that this cost would fall out if read-modify-write support was already present for L1 ECC support (which feature would please some users).

I am not well-read on silent store elimination, so there are probably other issues (and workarounds).

With much of the low-hanging fruit for performance improvement having been taken, more difficult, less beneficial, and less general optimizations become more attractive. Since silent store optimization becomes more important with higher inter-core communication and inter-core communication will increase as more cores are utilized to work on a single task, the value of such seems likely to increase.

  • Thanks for your answer which gives me a lot to investigate further, but I note you imply that "Intel [does not] require every write to have a read" which is very much not my understanding. Except for non-cachable memory and non-temporal writes (both of which would exclude such stuff) every write requires the value to be in cache, so forces a read if the cacheline is not already present. – Tim Nov 21 '17 at 17:41
  • 1
    @Tim Read-for-ownership avoidance is a similarly academic proposal. Among other things, it requires tracking validity/dirtiness at a finer granularity. Given that tag ECC is less common than data ECC ("oh dear, we would have to spend a few more bits on tags!"), supporting finer granularity validity (which also increases coherence complexity) is not a quickly adopted optimization. –  Nov 21 '17 at 17:48
  • @Tim - my understanding of what Paul was saying there was specifically that Intel doesn't require a read from the L1 cache to the core/store buffer implement a write: the bytes can simply be stored into the L1 (when the line is present) without a read. ECC is mentioned because normally a read would be needed if the L1 was ECC protected since you need the values adjacent to the store to recalculate the error correcting code. Paul suggests that Intel instead uses a simpler error checking mechanism (parity) that can be updated without needing the adjacent bytes. – BeeOnRope Nov 23 '17 at 17:34
  • Everything you said about "writes implying reads" is correct - but you are talking about the path from L1 to L2 and higher levels of the cache hierarchy and memory, which is different than what Paul was talking about. – BeeOnRope Nov 23 '17 at 17:34
  • @BeeOnRope ah.. that makes sense, thanks for clarifying where I failed to see the subtleties of the point being made by Paul (subtle to me in my ignorance, that is) – Tim Nov 24 '17 at 20:57
  • 1
    @PaulA.Clayton, RFO avoidance does not require partial line tagging if done on a full line granularity. With AVX512 this is a very likely use-case (but consecutive smaller stores may also be merged without breaking ordering). It's worth noting that this does not also allows you to avoid coherency-related flows (snoops and such), only the data fetch. Whether this really happens or not is a different question, but one that is not too hard to check. – Leeor Feb 25 '18 at 19:23
  • The IBM z990 processor does implement silent store elimination according to [The IBM eServer z990 microprocessor](https://ieeexplore.ieee.org/abstract/document/5388894). Although the main purpose is not to reduce the need for marking lines dirty or sending ownership coherence requests. Instead, it's employed to reduce L2 cache bandwdith consumption. In the z990, the L2 is shared and write-back while the L1 is private and write-through and there can be up two 16 processors sharing the L2. With that many write-through L1s, a lot of write traffic can be generated to the L2... – Hadi Brais Jul 25 '20 at 01:40
  • ...Silent stores are eliminated by reading the old value from the L1, storing the new value in the L1, and storing both values in a local buffer between the L1 and L2. The store is only sent to the L2 when the two values differ, otherwise it's discarded. That said, it may also improve performance by not marking lines dirty in the L2 and eventually unnecessarily writing them back to memory. Silent store elimination has benefits beyond bandwdith and coherence. For example, Intel probably eliminates silent stores at the memory controller level to improve write endurance... – Hadi Brais Jul 25 '20 at 01:40
  • ...of the non-volatile Optane memory modules. The same applies to SSDs but at the level SSD controllers. Other benefits include reducing false memory ordering violations at the MOB unit and reducing hardware resource consumption in general anywhere in the system after the point of store elimination. I mostly agree with the points made in your answer, but not with the order in which they appear. The points about design complexity and memory consistency should appear first; the others are less important... – Hadi Brais Jul 25 '20 at 01:41
  • ...I'm not sure though SSE violates the x86 memory consistency model (if you have an example in mind, it'd be useful to add it), but I think it poses consistency issues in other architectures. Anyway, this is a big topic. – Hadi Brais Jul 25 '20 at 01:41
6

I find evidence that some modern x86 CPUs from Intel, including Skylake and Ice Lake client chips, can optimize redundant (silent) stores in at least one specific case:

  • An all zero cache line is overwritten fully or partially with more zeros.

That is, a "zeros over zeros" scenario.

For example, this chart shows the performance (the circles, measured on the left axis) and relevant performance counters for a scenario where a region of varying size is filed with 32-bit values of either zero or one, on Ice Lake:

Ice Lake Fill Performance

Once the region no longer fits in the L2 cache, there is a clear advantage for writing zeroes: the fill throughput is almost 1.5x higher. In the case of zeros, we also see that the evictions from L2 are not almost all "silent", indicating that no dirty data needed to written out, while in the other case all evictions are non-silent.

Some miscellaneous details about this optimization:

  • It optimizes the write-back of the dirty cache line, not the RFO which still needs to occur (indeed, the read is probably needed to decide that the optimization can be applied).
  • It seems to occur around the L2 or L2 <-> L3 interface. That is, I don't find evidence of this optimization for loads that fit in L1 or L2.
  • Because the optimization takes effect at some point outside the innermost layer of the cache hierarhcy, It is not necessary to only write zeros to take advantage: it is enough that the line contains all zeros only once it is written back to the L3. So starting with an all-zero line, you can do any amount of non-zero writes, followed by a final zero-write of the entire line1, as long as the line does not escape to the L3 in the meantime.
  • The optimization has varying performance effects: sometimes the optimization is occurring based on observation of relevant perf counts, but there is almost no increased throughput. Other times the impact can be very large.
  • I don't find evidence of the effect in Skylake server or earlier Intel chips.

I wrote this up in more detail here, and there is an addendum for Ice Lake, which exhibits this effect more strongly here.

Update, June 2021: This optimization has been disabled in the newest CPU microcode versions provided by Intel, for security reasons (details).


1 Or, at least overwrite the non-zero parts of the line with zeros.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • this was hand-written asm, to avoid the [Why is std::fill(0) slower than std::fill(1)?](https://stackoverflow.com/q/42558907) asm differences when GCC recognizes 0-fill as memset but dword 1 fill only as normal auto-vectorization? – Peter Cordes Jul 28 '20 at 22:17
  • Oh right, you wrote this up earlier. IIRC I looked at the time for that possible problem, and I think you avoided it, but I don't remember how. – Peter Cordes Jul 28 '20 at 22:26
  • 1
    @PeterCordes - well I implemented it several different ways, but for the diagram shown here and most of the other results, I just ensured that the exact same function was used regardless of the fill value: i.e., the fill value is passed as an argument to a non-inlined function, so I can be sure the same code (literally, as in the same bytes in the .text section) is executing for both tests, with only register contents varying. See eg [here](https://github.com/travisdowns/zero-fill-bench/blob/master/algos.cpp#L34). – BeeOnRope Jul 28 '20 at 22:41
  • This particular case does rely on `HEDLEY_NEVER_INLINE` (but I checked the assembly), so a safer approach would be separate compilation, the argument passed in from another TU, plus laundering the argument through one of the usual tricks so that it loses its constantness (as a final defnece, e.g., in the face of LTO). – BeeOnRope Jul 28 '20 at 22:43
  • Sounds good. Using the same machine code at the same address (by using a runtime-variable arg) is even better, rules out any possible front-end effects, too. – Peter Cordes Jul 28 '20 at 22:53
  • We might be able to figure out when the 0 -> 0 detection happens by storing a non-zero into L1d then writing with 0 again (in small blocks so the non-zeros never leave L1d). If that defeats the hardware optimization, then we can infer that there's some "has always been zero" bit in L1d. Otherwise we can infer that there's some kind of checking in L2 to re-establish whether the line is currently 0. (And I guess compare that to a bit that records the fact that original version from L3 was all-zero, which is kept with the line in L1 and/or L2?) – Peter Cordes Jun 22 '21 at 15:55
  • (And BTW, I assume this hardware optimization is due to inefficient software doing this a lot, e.g. for `std::vector` that can't easily take advantage of `calloc` because C++ sucks.) – Peter Cordes Jun 22 '21 at 15:56
  • 1
    @PeterCordes - yeah I did that test already, some [discussion here](https://twitter.com/trav_downs/status/1262583343171895297). I think it supports the idea that the optimization happens at the L1<->L2 boundary or in the L2. That is, if the non-zero value in your suggested test never escapes the L1, the optimization happens. When it escapes the L1 into the L2, it stops. – BeeOnRope Jun 22 '21 at 16:01
  • *> I assume this hardware optimization...* yeah there are a lot of reasons for zeroing. Consider all the zeroing the kernel or other "safe" memory allocators do too (although some of that happens with rep movsb which doesn't need this optimization, or NT stores which also don't need it). Similarly with memory languages like Java that tend to eagerly init all their arrays to zero, and even init all their heap memory as well (both "init every zero and overwrite non-zero in ctor" (requires zero-init memory) and "write every field in ctor even zero" (can use garbage memory safely) are viable). – BeeOnRope Jun 22 '21 at 16:04
  • Oh yes, your answer here already has that conclusion about where it happens. (I had been looking at [the version on software-eng.SE](//softwareengineering.stackexchange.com/a/413223) which has an old version of this that asks the question of where it happens, but commented here without re-reading this answer). And yeah, languages like C++ or Java failing to optimize out zero-init even when the memory is fresh from the kernel and guaranteed zero is exactly what I had in mind, but going point that no-RFO is incompat. I guess also any "safe destructor" that zeros freed objects could create this. – Peter Cordes Jun 22 '21 at 16:32
  • @PeterCordes - oh yeah, I forgot about that and maybe now I'm stuck keeping these in sync forever. Not sure if you saw but this optimization was disabled in the newest microcode :(. – BeeOnRope Jun 22 '21 at 16:59
  • Yeah, I saw that. But only just now actually read your blog post and realized the only reason was timing side-channels, not any kind of correctness problem. That's really disappointing. It wouldn't be quite so bad if there was an MSR that allows OSes to tell the CPU not to worry about really obscure / hard-to-use timing side channels, although to be fair mainstream OSes wouldn't set that by default, and hardly anyone would enable it on their desktops / laptops, so it wouldn't be saving the world much energy or many cycles. – Peter Cordes Jun 22 '21 at 17:50
  • @BeeOnRope re"supports the idea that the optimization happens at the L1<->L2 boundary" another piece of evidence for that is that 0o0 and optimizing out RFO requests appear to be mutually exclusive. – Noah Aug 05 '21 at 04:19
  • @Noah - can you elaborate? – BeeOnRope Aug 05 '21 at 20:28
  • @BeeOnRope [your comment here](https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt.html#comment-75fafc90-740a-11eb-93fe-079499d9a376) seems to explain `rep stosb` not being able to do 0o0 because it issue an [ItoM](https://community.intel.com/t5/Software-Tuning-Performance/What-is-the-ItoM-IDI-opcode-in-the-performance-monitoring-events/m-p/1253627#M7797) as opposed to a full RFO. If the data from RFO is where 0o0 takes places then its indeed in L1 <-> L2 (LFB to me more specific). That being said, I guess noncachable writes would probably also skip whatever mechanism in L1 \ – Noah Aug 05 '21 at 22:06
  • was doing 0o0. Could you test this with a non-unified L2 cache by seeing if 0o0 could still take place if the L2 version of the line was evicted by L3 but L1 version still present? What does that fact that 0o0 falls off after region size > L3 tell us about the implementation? That the line has to come into L1 in a specific way or about what triggered the eviction? – Noah Aug 05 '21 at 22:12
5

It's possible to implement in hardware, but I don't think anybody does. Doing it for every store would either cost cache-read bandwidth or require an extra read port and make pipelining harder.

You'd build a cache that did a read/compare/write cycle instead of just write, and could conditionally leave the line in Exclusive state instead of Modified (of MESI). Doing it this way (instead of checking while it was still Shared) would still invalidate other copies of the line, but that means there's no interaction with memory-ordering. The (silent) store becomes globally visible while the core has Exclusive ownership of the cache line, same as if it had flipped to Modified and then back to Exclusive by doing a write-back to DRAM.

The read/compare/write has to be done atomically (you can't lose the cache line between the read and the write; if that happened the compare result would be stale). This makes it harder to pipeline data committing to L1D from the store queue.


In a multi-threaded program, it can be worth doing this as an optimization in software for shared variables only.

Avoiding invalidating everyone else's cache can make it worth converting

shared = x;

into

if(shared != x)
    shared = x;

I'm not sure if there are memory-ordering implications here. Obviously if the shared = x never happens, there's no release-sequence, so you only have acquire semantics instead of release. But if the value you're storing is often what's already there, any use of it for ordering other things will have ABA problems.

IIRC, Herb Sutter mentions this potential optimization in part 1 or 2 of his atomic Weapons: The C++ Memory Model and Modern Hardware talk. (A couple hours of video)

This is of course too expensive to do in software for anything other than shared variables where the cost of writing them is many cycles of delay in other threads (cache misses and memory-order mis-speculation machine clears: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)


Related: See this answer for more about x86 memory bandwidth in general, especially the NT vs. non-NT store stuff, and "latency bound platforms" for why single-threaded memory bandwidth on many-core Xeons is lower than on a quad-core, even though aggregate bandwidth from multiple cores is higher.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks Peter , I've seen it explicitly done as you note to avoid invalidating other caches, I was wondering more if the SRAM cells in cache could have a write-read logic (https://en.wikipedia.org/wiki/Static_random-access_memory#SRAM_operation) such that drivers writing could detect if the write changed the state (comparable to the read), and a simple OR ing of these values and the current dirty flag would then be written to the dirty bit, potentially avoiding a writeback.As you note, this would seem to be too expensive to be worthwhile .. so far.. – Tim Nov 23 '17 at 10:28
  • 1
    @Tim: Yeah, I thought that's what you were asking. It seems like it would be easy and good until you remember that caches are pipelined and support 1 write per clock. In modern Intel CPUs, there's no perf penalty for unaligned writes (including 32B AVX vectors) as long as they don't cross a cache line boundary, so any multi-cycle operation gets messy with overlap from subsequent stores. (some algos, like https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask, depend on efficient overlapped stores.) – Peter Cordes Nov 23 '17 at 10:38
  • If overlapped stores were the only problem, you could build smarter merging into the store queue (Memory Order Buffer), but even that might interfere with atomicity / memory ordering. I'm really glad Paul Clayton answered this question, too; he pointed out some stuff I hadn't thought of, and generally knows a lot about CPU architecture stuff outside of x86. I didn't know the "silent store" name for this in the literature, because I don't follow CPU architecture academic research or discussion forums regularly. (Just x86 internal / optimization stuff.) – Peter Cordes Nov 23 '17 at 10:42
  • There's no need to invalidate other caches, you can assume your silent store magically went to each core with a shared copy and changed them all to the "new" value without breaking coherence. – Leeor Nov 23 '17 at 14:45
  • 1
    The optimization of conditionally writing at the software level could still make a lot of sense even in the absence of multithreading: imagine a memcpy where with very high probability the destination is already the same as the source (for most cache lines). If you implemented this to check equality first, you'd remove the store traffic entirely for lines that were equal. For largish vectorized copies the memory traffic tends to be the dominating factor so this would help versus a normal copy (it is incompatible with NT stores, however). – BeeOnRope Nov 23 '17 at 17:40
  • 1
    @Leeor: You could do that, but if the compare result is not-equal, you have to re-schedule for commit later when you do own the line. If you already have the line in `E` state, you could switch it the `M` state or not depending on the compare result, but the store can be committed either way. So it's a much less intrusive design change (but much less powerful an optimization). – Peter Cordes Nov 23 '17 at 21:31
  • @BeeOnRope yes, that's what I'm thinking - I work on vectorising a large library of primitive routines, so cannot judge when CPU-executed comparisons are worthwhile but many operations deferred to primitives (such as "(zero) fill" and "clip numbers to bounds") could frequently reduce their memory bandwidth demands by up to 50% if silent store could be implemented without harming performance elsewhere (a big if, of course). – Tim Nov 24 '17 at 21:06
  • 1
    @Tim well usually it's only a reduction in memory bandwidth by 33%. You are going from 2 reads (1 src, 1 dest for RFO) and 1 write (dest) to 2 reads (1 src, 1 dest for RFO). Keep in mind that if your arrays are large you should look at NT stores which gets the same reduction in a different way (1 read for src, 1 write for dest) and may be faster (since it seems on some chips that total bandwidth is higher with some NT stores in the mix). – BeeOnRope Nov 24 '17 at 21:23
  • 1
    @Tim: There's a long SO answer with lots more details on NT vs. non-NT stores and related memory bandwidth stuff: https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy/43574756#43574756 – Peter Cordes Nov 24 '17 at 22:13
  • @BeeOnRope Thanks, but I'm working on utility routines in a library, vectors may be small, or may be large. Values may be needed again soon (or already in cache) or not. For a "clip this vector to between lower and upper bounds" I can do selective "save if changed" serially (but with 2 branches), or vectorised without branching but always write-back the value (perhaps with NT stores). There's no single right or wrong way for all cases (and it's not that critical) as it depends on how many values are already in bounds etc, but just made me wonder about the trade-off of branch vs store. – Tim Nov 27 '17 at 10:59
  • @Tim: yeah, the right choice depends on the surrounding code, so generic library implementations have to compromise. You might make each function branch on the problem size to use NT store or not (maybe with the size threshold in a global variable, because the threshold will depend on the hardware and the surrounding code. That also makes it easy for the caller to disable NT stores, by setting the threshold very high). Or provide `_nt` versions of some common functions. – Peter Cordes Nov 27 '17 at 11:06
  • 1
    @Tim: It's probably not a good idea to selectively save-if-changed by even in an in-place clamp, unless you have versions with/without that behaviour. It probably loses hard if you branch-miss a lot, but the worst-case slowdown from the memory bandwidth from not using selective stores is a smaller percentage. Also, the extra compare and branching can bottleneck on the front-end with data hot in L1D cache, even with perfect branch prediction. (Or at least use up extra space in the ROB / PRF, limiting your out-of-order window.) – Peter Cordes Nov 27 '17 at 11:08
  • That was my thinking re: branching @PeterCordes - thanks for your input. It's easy to draw up a misleading benchmark for this sort of stuff. The size-for-NT threshold I've done for some routines but (so far) it's hit so rarely (and we have SO many calls for small vectors ~50-500) that even that branch is measurably worth removing. – Tim Nov 27 '17 at 12:17
  • @Tim: Do you have the branch-for-NT inside the loop or something? I'm surprised `cmp ecx, [nt_thresh]` / `ja nt_version` near the start of a function has measurable overhead with a size of 50 or more. It's a single uop, and should predict well as fall-through (unless it aliases with some other branches...) Oh, is that 50 elements, so only 50/8 YMM vectors or something? If small use-cases are common, or rewrite-in-place are very common, then even branching to check for NT is probably not a good idea, sure. But I'm surprised it's actually measurable. – Peter Cordes Nov 27 '17 at 12:21
  • 1
    @PeterCordes Yep, 50 doubles or smaller - when you provide simple functions people tend to get used to calling them even for small vectors. We have specialised classes for 2x2 and 3x3 matrices & some small vector optimisations, but quant libraries seem to have lots of small operations (see our previous conversations about why AVX hurts our overall performance in practice). If you haven't seen it, Carl Cook's talk at CppCon (https://github.com/CppCon/CppCon2017 + youtube) may be of interest eg disable all CPU cores but one on a Xeon to get the entire L3 cache dedicated to the single thread – Tim Nov 27 '17 at 12:36
  • 1
    The best option for functions that have a range of inputs, including many very small outputs is often to unroll the first few loops at the start of the function: this lets you apply "small data" optimizations and sometimes help branch prediction a lot by "grouping" various input sizes into the same branching behavior. Once you get through that section, you know you already have at least a medium size input and then you do your "is it big check" - that makes the "is it big" check more or less free since you aren't doing it for the small cases at all. – BeeOnRope Nov 27 '17 at 18:39
  • 1
    In a library, where you don't know the "shape" of the data (e.g., some inputs may almost never clip, others randomly, others almost always, etc), you can always try an adaptive algorithm, e.g., start out in a default loop and if you find that one of the more extreme cases holds, move to a loop that handles that better, etc. Ideally the "tracking" of what case you are in falls out of the algorithm naturally and is free or nearly so, but that's not always the case so you can have instrumented and non-instrumented versions of the loop, etc. – BeeOnRope Nov 27 '17 at 18:42