How do I write a reliable content-independent implementation of memcmp()?

Question

A naive implementation of memcmp() goes something like this (from this answer):

int memcmp_test(const char *cs_in, const char *ct_in, size_t n)
{
   size_t i;  
   const unsigned char * cs = (const unsigned char*) cs_in;
   const unsigned char * ct = (const unsigned char*) ct_in;

   for (i = 0; i < n; i++, cs++, ct++)
   {
       if (*cs < *ct)
       {
          return -1;
       }
       else if (*cs > *ct)
       {
          return 1;
       }
   }
   return 0;
}

Here the blocks traversal is stopped once the first mismatching byte is found. This can be no good for cryptographic applications because it makes the execution time dependent on the blocks content and this could allow for timing attacks. So OpenSSL uses this (taken from here):

int CRYPTO_memcmp(const void *in_a, const void *in_b, size_t len)
{
    size_t i;
    const unsigned char *a = in_a;
    const unsigned char *b = in_b;
    unsigned char x = 0;

    for (i = 0; i < len; i++)
         x |= a[i] ^ b[i];

    return x;
}

There're no breaks or returns in the middle, so this code will have to traverse the entire length of blocks. At least this is the intent.

Now here's one usage example (from here):

 static int des_ede3_unwrap(EVP_CIPHER_CTX *ctx,
     unsigned char *out, const unsigned char *in, size_t inl)
 {
      unsigned char icv[8], iv[8], sha1tmp[SHA_DIGEST_LENGTH];
      //whatever, unrelated then...
      if (!CRYPTO_memcmp(sha1tmp, icv, 8))
         rv = inl - 16;
      //whatever, unrelated
 }

Now with link-time code generation (Visual C++ LTCG) or link-time optimization (gcc LTO) the compiler is able to see both CRYPTO_memcmp() implementation and the invocation site (even if they are in different translation units). It can see that the invocation site does not use the actual value, it just compares it to null. So it is free to transform CRYPTO_memcmp() such that it return immediately once it find the first mismatching pair of bytes and the "secure" version of memcmp() is no longer secure.

How can memcmp() be implemented such that the a standard compliant compiler will not transform it into version that helps timing attacks?

By telling the compiler not to optimize. Either by using flags or `#pragma`. It's obviously compiler dependent. — stefan, Apr 08 '14 at 11:24
Well what do you want? Optimized code or safe code? With pragmas, used only rarely, this won't get too slow. There are also gcc attributes that prohibit inlining. You really have to ask _your_ compiler not to do so or just leave out LTO (I never had any significant speed-up anyway. Only Compile time went up). As always: measure & profile. — stefan, Apr 08 '14 at 11:31
@stefan: This is all not really reliable - code will be reused elsewhere, where LTO or something will be on. A much better way would be to have some standard-compliant code that just traverses the entire blocks. — sharptooth, Apr 08 '14 at 11:40
The standard explicitly wants optimizations to happen and encourages it. Any optimization which does not affect the observable outcome is OK, basically. You _have to_ rely on the compiler not optimizing. You can do so on the flag level or the compiler specific macros / attributes level. End of story. — stefan, Apr 08 '14 at 11:42
This falls under the category of "making it hard for the compiler to optimize", but it isn't in the category of "the compiler doesn't optimize". Maybe your compiler can't optimize with a `volatile` somewhere. That doesn't mean that it is impossible and a future version / different vendor might do it. Work _with_ the compiler (using the tools [pragma...] it provides), not against it. The compiler is your friend who rewrites your code in a more efficient way. If you insist on your code being the way you want it, you have to tell him. — stefan, Apr 08 '14 at 11:48
Note, that the caller code is also subject to timing attack due to if checks. — Eugene Mayevski 'Callback, Apr 08 '14 at 12:17
@sharptooth 1) Are you looking for an `int memcmp()` like function that returns <0, 0 >0 on less then, equal or greater **or** something different like `bool memeq()` that returns true/false on match/mis-match? 2) Must data be `const` or can it be manipulated and returned to the same state on function exit? — chux - Reinstate Monica, Apr 08 '14 at 13:45
@chux: Let's pretend it should be `const` and the result should be `<0`, `0`, or `>0` as with usual `memcmp()`. — sharptooth, Apr 08 '14 at 13:51
In `CRYPTO_memcmp()`, could not the variable `unsigned char x` be made `volatile unsigned char x` to prevent the complier from transforming `CRYPTO_memcmp()`? — chux - Reinstate Monica, Apr 08 '14 at 15:49
I'm curious what the solution was, can you post an answer telling us what you did? (and accept it?) — Mike Ounsworth, Nov 12 '14 at 03:29
@MikeOunsworth: There're two possible solutions. The first one absolutely portable and up to the Standard - declare `x` volatile - but that makes code about two times slower. The second one is to cast the pointers into `volatile unsigned char*` and access through them which is just as fast but is not fully Standard compliant (see this http://stackoverflow.com/q/13268657/57428). — sharptooth, Nov 12 '14 at 07:29

score 4 · Accepted Answer · edited May 23 '17 at 10:28

There're two possible solutions.

The first one is absolutely portable and up to the Standard - declare x volatile which basically tells the compiler that it must preserve the sequence of updating x and so it must read both data arrays fully. It doesn't prevent the compiler from making those reads in larger portions (such as read several bytes at a time and then use them in the right sequence), but it's no big deal here - the compiler will have to emit a number of reads proportional to the number of bytes in the data arrays. The problem with this approach is that it can make this code slower - some benchmarks I ran show about 50 percent slowdown on a specific processor and specific toolset. YMMV.

The second possible solution is to cast the pointers into volatile unsigned char* and access through them

const volatile unsigned char * cs = (const volatile unsigned char*) cs_in;
const volatile unsigned char * ct = (const volatile unsigned char*) ct_in;
// the rest of the code is the same

which is just as fast but is not fully Standard compliant (see this). Many compiler treat such cast as a hint that these reads should not be altered but the Standard doesn't really make any guarantees for that and so it is possible that a compiler breaks this code on purpose. Therefore this solution is not portable.

If the function were to check whether `x` is greater than an exported variable whose value always happens to be zero, rather than checking whether x is non-zero, the only way a compiler could generate efficient early-exit code would be to have test the value of that exported variable before starting the loop, and then run one loop (with an early-exit feature) if that exported variable happens to be zero and another if it doesn't. If someplace in the code sets the value of that exported variable based on some some `volatile` that's also always zero, a compiler would... — supercat, Jul 30 '18 at 15:49
...have no realistic basis for expecting such an "optimization" to be helpful, and the cost would be one `volatile` load and ordinary store, sometime during program execution, plus one non-qualified load and compare for each complete call of the memory-compare function. BTW, I'd call the function something like `mem_notequal` to make clear that the return value does *not* follow the `memcmp` pattern. — supercat, Jul 30 '18 at 15:51

How do I write a reliable content-independent implementation of memcmp()?

1 Answers1