4

Earlier this year Intel published a list of instructions that are guaranteed not to have timing dependency on its data operands. (Initially it was suggested that these are constant-time only when DOITM is enabled, but later it was clarified that these are always constant-time, regardless of DOITM.) Out of curiosity I am looking at how closely real-world crypto implementations conform to this list (i.e. only using instructions from this list).

It turns out this list has a number of oddities. It has MOVDQU, but not MOVUPS, even though the two should be functionally identical. This is not a serious issue: I can simply take the assembly output of the compiler, and do sed 's/movups/movdqu/g' before assembling.

A more difficult obstacle is that it does not have (V)SHUFPS, even though it clearly has lots of other floating point shuffling instructions like VPERMILPS/D. SHUFPS is used in BLAKE3.

Is there a known reason this instruction is not included on the constant-time list? What would be a good way to simulate its functionality, using only instructions from this list?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user2249675
  • 464
  • 3
  • 13
  • Following this post (https://stackoverflow.com/questions/26983569/implications-of-using-mm-shuffle-ps-on-integer-vector), I think a possible workaround is to use pshufd (shuffle 32-bit within 128-bit lane) plus vpblendd (blend two 128-bit lane). – user2249675 Jul 27 '23 at 16:55
  • 3
    On Skylake at least, some instructions have bypass latency that depends on which execution port it picked (and/or what port the input came from or output is going to, IIRC from Intel's optimization manual). Like with `andps` between FP math instructions. Makes me wonder if maybe `shufps` on Ice Lake might depend on whether it gets scheduled to port 1 or port 5. That wouldn't be *data* dependent, even though non-constant, so might actually be good for introducing noise to timing attacks. But if there's extra latency, it might only be for forwarding to some instruction types. – Peter Cordes Jul 27 '23 at 17:56

1 Answers1

4

I cannot find an answer to the first question (why it is not in the list), but I have a solution to the second question, namely how to workaround this instruction. For the BLAKE3 implementation, the problematic line is

#define _mm_shuffle_ps2(a, b, c)                                               \
  (_mm_castps_si128(                                                           \
      _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), (c))))

A drop in replacement is

#define _mm_shuffle_ps2(a, b, c) \
      _mm_blend_epi32 (_mm_shuffle_epi32((a), (c)), _mm_shuffle_epi32((b), (c)), 0b1100)

This causes GCC to generate VPSHUFD and VPBLENDD, both of which should be constant-time according to Intel.

user2249675
  • 464
  • 3
  • 13