Implementing a timing, cache attack resistant sort in C

Question

Disclaimer: I am well aware implementing your own crypto is a very bad idea. This is part of a master thesis, the code will not be used in practice.

As part of a larger cryptographic algorithm, I need to sort an array of constant length (small, 24 to be precise), without leaking any information on the contents of this array. As far as I know (please correct me if these are not sufficient to prevent timing and cache attacks), this means:

The sort should run in the same amount of cycles in terms of the length of the array, regardless of the particular values of the array
The sort should not branch or access memory depending on the particular values of the array

Do any such implementations exist? If not, are there any good resources on this type of programming?

To be honest, I'm even struggling with the easier subproblem, namely finding the smallest value of an array.

double arr[24]; // some input
double min = DBL_MAX;

int i;
for (i = 0; i < 24; ++i) {
    if (arr[i] < min) {
        min = arr[i];
    }
}

Would adding an else with a dummy assignment be sufficient to make it timing-safe? If so, how do I ensure the compiler (GCC in my case) doesn't undo my hard work? Would this be susceptible to cache attacks?

A sorting network plus a constant-time compare-and-swap is what you want. See [here](http://hoytech.github.io/sorting-networks/) (slide 32) for the idea. Since the size of your problem is small, the cost of the network probably won't be prohibitive. — Michael Foukarakis, Apr 21 '16 at 10:57
If you are really going *that* deep, you need to think about what the compiler actually *makes* from your code - i.e. "think in assembler". Working on "C" level will not be enough, especially with modern, very aggressively optimizing compilers. `min = arr [i]`most probably will work out to one single register assignment (check this ;) ), with no real externally measurable timing impact. Any `else`clause would be unnecessary. In case things in an `if`clause are working out to be more complex, your approach is the right one. — tofro, Apr 21 '16 at 10:58
My intuition tells me that you can't get away with sorting some data and not leak information about its properties through branch timing and cache behavior. Why not inject noise into the system to conceal things instead? Shuffle your array with fisher-yates fed from a CSPRNG, then sort it normally with quicksort. Intuitively this feels sufficient since all timing attacks I've seen this far depend on repeated crypto operations that leak a little bit of information each time. So unless one timing attack can extract the full key used for the shuffle you should be fine. — Art, Apr 21 '16 at 11:37
Actually, my intuition is probably wrong. Injecting noise might still be a good idea, but I think you can sort without branching or touching different cache lines. Working on a proof-of-concept now. — Art, Apr 21 '16 at 12:01

2501 · Accepted Answer · 2016-04-21T11:12:19.757

3

Use a sorting network, a series of comparisons and swaps.

The swap call must not be dependent on the comparison. It must be implemented in a way to execute the same amount of instructions, regardless of the comparison result.

Like this:

void swap( int* a , int* b , bool c )
{
    const int min = c ? b : a;
    const int max = c ? a : b;
    *a = min;
    *b = max;  
}

swap( &array[0] , &array[1] , array[0] > array[1] );

Then find the sorting network and use the swaps. Here is a generator that does that for you: http://pages.ripco.net/~jgamble/nw.html

Example for 4 elements, the numbers are array indices, generated by the above link:

SWAP(0, 1);
SWAP(2, 3);
SWAP(0, 2);
SWAP(1, 3);
SWAP(1, 2);

edited Apr 21 '16 at 11:12

answered Apr 21 '16 at 11:06

2501

25,460
4
47
87

This will have different branch behavior depending on the data being sorted, so I'm pretty sure it will leak information. Timing attacks are less about instructions executed and more about branches missed and cache lines touched. You touch the same cache lines all the time, so that's good, but you still can't get away from the branches. – Art Apr 21 '16 at 11:43
@Art: Could you explain how the branching in 2501's swap could leak information? As far as I can tell the exact same instructions would be executed, except in a different order. 1 branch will be missed for any input, right? – bkjvbx Apr 21 '16 at 11:54
@Art Branches can be removed using tricks: https://stackoverflow.com/questions/227383/how-do-i-programmatically-return-the-max-of-two-integers-without-using-any-compa but in a less portable way. – 2501 Apr 21 '16 at 11:54
@qxwevr I don't know. This is why I don't design my own crypto. But from what I hear from people who do know is that branching is not safe regardless of how many instructions you execute. – Art Apr 21 '16 at 12:04
1

@qxwevr: The problem with any branch on a modern microprocessor is that the branching itself can take different amounts of time, depending on whether the branch predictor [guessed right or not](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array). I don't know, if this will be a problem here, because I'd expect the ternary statement to be compiled into a conditional move (on x86), so there is no actual branch. Still, I'd rather err on the side of caution, where crypto is involved. – MikeMB Apr 21 '16 at 15:57

Art · Answer 2 · 2016-04-21T14:24:02.957

This is a very dumb bubble sort that actually works and doesn't branch or change memory access behavior depending on input data. Not sure if this can be plugged into another sorting algorithm, they need their compares separate from the swaps, but maybe it's possible, working on that now.

#include <stdint.h>

static void
cmp_and_swap(uint32_t *ap, uint32_t *bp)
{
        uint32_t a = *ap;
        uint32_t b = *bp;
        int64_t c = (int64_t)a - (int64_t)b;
        uint32_t sign = ((uint64_t)c >> 63);
        uint32_t min = a * sign + b * (sign ^ 1);
        uint32_t max = b * sign + a * (sign ^ 1);
        *ap = min;
        *bp = max;
}

void
timing_sort(uint32_t *arr, int n)
{
        int i, j;
        for (i = n - 1; i >= 0; i--) {
                for (j = 0; j < i; j++) {
                        cmp_and_swap(&arr[j], &arr[j + 1]);
                }
        }
}

The cmp_and_swap function compiles to (Apple LLVM version 7.3.0 (clang-703.0.29), compiled with -O3):

_cmp_and_swap:
00000001000009e0        pushq   %rbp
00000001000009e1        movq    %rsp, %rbp
00000001000009e4        movl    (%rdi), %r8d
00000001000009e7        movl    (%rsi), %r9d
00000001000009ea        movq    %r8, %rdx
00000001000009ed        subq    %r9, %rdx
00000001000009f0        shrq    $0x3f, %rdx
00000001000009f4        movl    %edx, %r10d
00000001000009f7        negl    %r10d
00000001000009fa        orl     $-0x2, %edx
00000001000009fd        incl    %edx
00000001000009ff        movl    %r9d, %ecx
0000000100000a02        andl    %edx, %ecx
0000000100000a04        andl    %r8d, %edx
0000000100000a07        movl    %r8d, %eax
0000000100000a0a        andl    %r10d, %eax
0000000100000a0d        addl    %eax, %ecx
0000000100000a0f        andl    %r9d, %r10d
0000000100000a12        addl    %r10d, %edx
0000000100000a15        movl    %ecx, (%rdi)
0000000100000a17        movl    %edx, (%rsi)
0000000100000a19        popq    %rbp
0000000100000a1a        retq
0000000100000a1b        nopl    (%rax,%rax)

Only memory accesses are reading and writing of the array, no branches. The compiler did figure out what the multiplication actually does, quite clever actually, but it didn't use branches for that.

The casts to int64_t are necessary to avoid overflows. I'm pretty sure it can be written cleaner.

As requested, here's a compare function for doubles:

void
cmp_and_swap(double *ap, double *bp)
{
        double a = *ap;
        double b = *bp;
        int sign = signbit(a - b);
        double min = a * sign + b * (sign ^ 1);
        double max = b * sign + a * (sign ^ 1);
        *ap = min;
        *bp = max;
}

Compiled code is branchless and doesn't change memory access pattern depending on input data.

If I'm not mistaken, this is very similar to 2501's answer, with the "tricks" he linked in his comment to avoid branching implemented, correct? Since bubblesort is essentially a specific type of sorting network. Do you know of a similar comparison without branching for floats and/or doubles? — bkjvbx, Apr 21 '16 at 13:48
yes, it's the same idea. Just wanted to dump it in form of code. For floats and doubles it should actually be simpler since extracting the sign bit is easier and we don't need to worry about overflow. I'll update in a sec. — Art, Apr 21 '16 at 14:20
Great and very portable answer. +1 (Hopefully signbit doesn't introduce any braches.) :-) — 2501, Apr 21 '16 at 14:52
@2501 I checked the generated code. It doesn't. Although I don't know about the timing of the instructions. Integer instructions have constant cost, but floating point could definitely behave differently depending on input. — Art, Apr 21 '16 at 16:04
@Art On my machine, signbit (on a double) appears to return 128 rather than 1 for negative input. Would `int sign = a - b < 0` be a valid alternative? I checked the assembly, there's (obviously) a compare but it's used with a set byte on comparison instruction. So that shouldn't count as branching, as far as I can tell. — bkjvbx, Apr 26 '16 at 18:58
@qxwevr oh. of course, that was stupid of me. You want something that checks if a value is 0 or not 0 without branching. `int sign = !!signbit(a - b)`. `!!` should be the most efficient way to normalize (set to 0 or 1) a boolean in an int and compilers should generate branchless code for that, if possible. I suspect that if "!!x" doesn't generate branchless code, then "a - b < 0" won't either. Btw. It's always safer to write "a < b" to avoid overflows and such. I only did it in the int example to not do comparisons which can cause branches and be able to extract the sign bit. — Art, Apr 28 '16 at 06:42

score 0 · Answer 3 · answered Apr 21 '16 at 11:19

A very trivial, time-constant (but also highly in-efficient) sort is to

have a src and destination array
for each element in the (sorted) destination array, iterate through the complete source array to find the element that belongs exactly into this position.

No early breaks, (nearly) constant timing, not depending on even partial sortedness of the source.

Implementing a timing, cache attack resistant sort in C

3 Answers3