Does memory dependence speculation prevent BN_consttime_swap from being constant-time?

Question

Context

The function BN_consttime_swap in OpenSSL is a thing of beauty. In this snippet, condition has been computed as 0 or (BN_ULONG)-1:

#define BN_CONSTTIME_SWAP(ind) \
    do { \
            t = (a->d[ind] ^ b->d[ind]) & condition; \
            a->d[ind] ^= t; \
            b->d[ind] ^= t; \
    } while (0)
…
    BN_CONSTTIME_SWAP(9);
…
    BN_CONSTTIME_SWAP(8);
…
    BN_CONSTTIME_SWAP(7);

The intention is that so as to ensure that higher-level bignum operations take constant time, this function either swaps two bignums or leaves them in place in constant time. When it leaves them in place, it actually reads each word of each bignum, computes a new word that is identical to the old word, and write that result back to the original location.

The intention is that this will take the same time as if the bignums had effectively been swapped.

In this question, I assume a modern, widespread architecture such as those described by Agner Fog in his optimization manuals. Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.

Question

I am trying to understand whether the construct above characterizes as a “best effort” sort of constant-time execution, or as perfect constant-time execution.

In particular, I am concerned about the scenario where bignum a is already in the L1 data cache when the function BN_consttime_swap is called, and the code just after the function returns start working on the bignum a right away. On a modern processor, enough instructions can be in-flight at the same time for the copy not to be technically finished when the bignum a is used. The mechanism allowing the instructions after the call to BN_consttime_swap to work on a is memory dependence speculation. Let us assume naive memory dependence speculation for the sake of the argument.

What the question seems to boil down to is this:

When the processor finally detects that the code after BN_consttime_swap read from memory that had, contrary to speculation, been written to inside the function, does it cancel the speculative execution as soon as it detects that the address had been written to, or does it allow itself to keep it when it detects that the value that has been written is the same as the value that was already there?

In the first case, BN_consttime_swap looks like it implements perfect constant-time. In the second case, it is only best-effort constant-time: if the bignums were not swapped, execution of the code that comes after the call to BN_consttime_swap will be measurably faster than if they had been swapped.

Even in the second case, this is something that looks like it could be fixed for the foreseeable future (as long as processors remain naive enough) by, for each word of each of the two bignums, writing a value different from the two possible final values before writing either the old value again or the new value. The volatile type qualifier may need to be involved at some point to prevent an ordinary compiler to over-optimize the sequence, but it still sounds possible.

NOTE: I know about store forwarding, but store forwarding is only a shortcut. It does not prevent a read being executed before the write it is supposed to come after. And in some circumstances it fails, although one would not expect it to in this case.

Related: `BN_consttime_swap` was added in March 2014 in response to [Recovering OpenSSL ECDSA Nonces Using the FLUSH+RELOAD Cache Side-channel Attack](http://eprint.iacr.org/2014/140). Here's the [commit with comments](http://git.openssl.org/gitweb/?p=openssl.git;a=commitdiff;h=4b7a4ba29cafa432fc4266fe6e59e60bc1c96332). — jww, Mar 20 '15 at 04:24
Related: https://stackoverflow.com/questions/27865974/is-masking-effective-for-thwarting-side-channel-attacks — Nayuki, Mar 25 '15 at 21:14
@NayukiMinase Hi! Do you contribute to OpenSSL? I know there are not many way to write these, but the style seems very familiar. If you have already contributed to OpenSSL, I would be happy to review time-constantness or other aspects in exchange for you reviewing patches I would like to get in. — Pascal Cuoq, Mar 25 '15 at 21:43
Is it possible that someone can "explain like im 5" this post? I am thoroughly confused. — masfenix, Mar 26 '15 at 03:56
@masfenix Everybody knows that memory accesses reveal, through execution time, information about the accessed address. This is one of the effects of cache. The question is whether the execution times of a memory write can depend on the value being written. More context about why the question needs to be asked and why it seems a priori possible that the execution times of a memory write would depend on the value being written have been written up by Robert Graham in a blog post: http://blog.erratasec.com/2015/03/x86-is-high-level-language.html — Pascal Cuoq, Mar 26 '15 at 07:05
@PascalCuoq: Hi there! Nope I don't contribute to OpenSSL, so I can't help you with patches. In case you'd like to chat about anything else, my contact information is available on the web. — Nayuki, Mar 29 '15 at 17:56

Stephen Canon · Answer 1 · 2015-04-04T10:58:20.243

Straightforward translation of the C code to assembly (without the C compiler undoing the efforts of the programmer) is also assumed.

I know it's not the thrust of your question, and I know that you know this, but I need to rant for a minute. This does not even qualify as a "best effort" attempt to provide constant-time execution. A compiler is licensed to check the value of condition, and skip the whole thing if condition is zero. Obfuscating the setting of condition makes this less likely to happen, but is no guarantee.

Purportedly "constant-time" code should not be written in C, full stop. Even if it is constant time today, on the compilers that you test, a smarter compiler will come along and defeat you. One of your users will use this compiler before you do, and they will not be aware of the risk to which you have exposed them. There are exactly three ways to achieve constant time that I am aware of: dedicated hardware, assembly, or a DSL that generates machine code plus a proof of constant-time execution.

Rant aside, on to the actual architecture question at hand: assuming a stupidly naive compiler, this code is constant time on the µarches with which I am familiar enough to evaluate the question, and I expect it to broadly be true for one simple reason: power. I expect that checking in a store queue or cache if a value being stored matches the value already present and conditionally short-circuiting the store or avoiding dirtying the cache line on every store consumes more energy than would be saved in the rare occasion that you get to avoid some work. However, I am not a CPU designer, and do not presume to speak on their behalf, so take this with several tablespoons of salt, and please consult one before assuming this to be true.

Making condition (and possibly other objects involved) volatile would at least give you a lot better chance of getting the desired behavior, since the compiler would no longer be free to make transformations that alter the number of accesses (and thus evaluations of the condition) performed. — R.. GitHub STOP HELPING ICE, Mar 19 '15 at 22:23
@R.. Completely agree. Assembly may be ideal from a pure security point-of-view, but trade-offs are everywhere and an open-source library may only afford a clean, well-written C version with `volatile`, which should keep it secure for the next 15 years. I am filing this idea for grouped discussion of all the constant-time constructs in the library. — Pascal Cuoq, Mar 19 '15 at 23:13
I don't think asm is even a solution. Nothing says the cpu is not free to perform additional optimizations. Certainly a dynrec-based virtual cpu could do this; in the future real cpus may too. IMO the **only** valid approach to constant-time is explicit sleeping/spinning to a desired total time, and thankfully that is perfectly expressible in C. — R.. GitHub STOP HELPING ICE, Mar 19 '15 at 23:22
ASM is only a solution for known microarchitectures, where you can guarantee that no such optimization occurs. Sleeping/spinning isn't viable, because it has a distinct power signature even if it gets you constant time. Ultimately, dedicated hardware is probably the only real solution. — Stephen Canon, Mar 20 '15 at 01:22

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

This blog post, and the comments made by the author, Henry, on the subject of this question should be considered as authoritative as anyone should allowed to expect. I will reproduce the latter here for archival:

I didn’t think the case of overwriting a memory location with the same value had a practical use. I think the answer is that in current processors, the value of the store is irrelevant, only the address is important.

Out here in academia, I’ve heard of two approaches to doing memory disambiguation: Address-based, or value-based. As far as I know, current processors all do address-based disambiguation.

I think the current microbenchmark has some evidence that the value isn’t relevant. Many of the cases involve repeatedly storing the same value into the same location (particularly those with offset = 0). These were not abnormally fast.

Address-based schemes uses a store queue and a load queue to track outstanding memory operations. Loads check the store queue to for an address match (Should this load do store-to-load forwarding instead of reading from cache?), while stores check the load queue (Did this store clobber the location of a later load I allowed to execute early?). These checks are based entirely on addresses (where a store and load collided). One advantage of this scheme is that it’s a fairly straightforward extension on top of store-to-load forwarding, since the store queue search is also used there.

Value-based schemes get rid of the associative search (i.e., faster, lower power, etc.), but requires a better predictor to do store-to-load forwarding (Now you have to guess whether and where to forward, rather than searching the SQ). These schemes check for ordering violations (and incorrect forwarding) by re-executing loads at commit time and checking whether their values are correct. In these schemes, if you have a conflicting store (or made some other mistake) that still resulted in the correct result value, it would not be detected as an ordering violation.

Could future processors move to value-based schemes? I suspect they might. They were proposed in the mid-2000s(?) to reduce the complexity of the memory execution hardware.

score 1 · Answer 3 · answered Apr 03 '15 at 16:38

The idea behind constant-time implementation is not to actually perform everything in constant time. That will never happen on an out-of-order architecture. The requirement is that no secret information can be revealed by timing analysis. To prevent this there are basically two requirements:

a) Do not use anything secret as a stop condition for a loop, or as a predicate to a branch. Failing to do so will open you to a branch prediction attack https://eprint.iacr.org/2006/351.pdf

b) Do not use anything secret as an index to memory access. This leads to cache timing attacks http://www.daemonology.net/papers/htt.pdf

As for your code: assuming that your secret is "condition" and possibly the contents of a and b the code is perfectly constant time in the sense that its execution does not depend on the actual contents of a, b and condition. Of course the locality of a and b in memory will affect the execution time of the loop, but not the CONTENTS which are secret. That is assuming of course condition was computed in a constant time manner. As for C optimizations: the compiler can only optimize code based on information it knows. If "condition" is truly secret the compiler should not be able to discern it contents and optimize. If it can be deducted from your code then the compiler will most likely make optimization for the 0 case.

You are not answering the question. You are giving an axiomatic definition of a notion of secret-time-independence (that forgets division) and you are claiming that the code in the question is secret-time-independent according to the axioms. Please read the question carefully, the question is whether the execution time of a memory write can measurably differ when the value being written is the same as the value already in memory. This is plausible (for reasons hilighted in the question) and would mean that your axiomatic definition is wrong for a second reason. — Pascal Cuoq, Apr 03 '15 at 17:45
“As for C optimizations: the compiler can only optimize code based on information it knows” is naive. The compiler knows that the macro `BN_CONSTTIME_SWAP` does not have any functional effects when `condition` is 0 **because the source code says so** and is free to compile the entire macro as if it was written `if (condition) …`. The question explicitly asks to ignore this, because this issue is orthogonal to the issue that the question is about. And lastly, it is not “my” code. It is from OpenSSL. — Pascal Cuoq, Apr 03 '15 at 18:02
I am not aware of any CPU that performs an actual check if the memory contents changed after an instruction executed. Certainly Intel CPUs do not perform such a check, and I think never will, because it is a waste of hardware. The point being: if an operation was executed on the memory already read, the CPU will use the value after the execution, regardless if the value changed or not. — Vlad Krasnov, Apr 07 '15 at 09:13

Does memory dependence speculation prevent BN_consttime_swap from being constant-time?

Context

Question

3 Answers3