2

Thank you very much, I am trying to optimizing the Kasumi algorithm written in C. There has S-box in FI function which uses to encrypt the data, S7-box has 127 elements and S9-box has 512 elements. the FI function code like:

static u16 FI(u16 in, u16 subkey)
{
    static u16 s7[] = {...};
    static u16 s9[] = {...};

    nine = (u16)(in>>7);
    seven = (u16)(in&0x7F);
    /* Now run the various operations */
    nine = (u16)(S9[nine] ^ seven);
    seven = (u16)(S7[seven] ^ (nine & 0x7F));
    seven ^= (subkey>>9);
    nine ^= (subkey&0x1FF);
    nine = (u16)(S9[nine] ^ seven);
    seven = (u16)(S7[seven] ^ (nine & 0x7F));
    in = (u16)((seven<<9) + nine);
    return( in );
}

u16 represents unsigned short.

By some transformation. I merge S7-box and S9-box to S16-box, and I use avx instruction to make 16 data parallel. the code of FI function like:

static u16 FI(__m256i in, u16 subkey)
{
    u16 arr[16];        
    _mm256_store_si256((__m256i*)arr, in);
    u8 i;           
    for(i = 0; i < 16; i++)
    {
        arr[i] = (u16)(s16[arr[i]] ^ subkey);
        arr[i] = (arr[i] << 7) | (arr[i] >> 9);
        arr[i] = s16[arr[i]];
    }
    in = _mm256_load_si256((__m256i*)arr);
}

S16-box has 65536 elements, so maybe some cache miss will happen. I also use gather instruction like:

inline static __m256i FI( __m256i in, u16 subkey )
{
    __m256i _tmp = _mm256_set1_epi32(0xffff);
    __m256i even_sequence = _mm256_and_si256(in, _tmp);
    __m256i odd_sequence = _mm256_srli_epi32(in, 16);
    even_sequence = _mm256_i32gather_epi32((int const*)s16, even_sequence, 2); 
    __m256i _subkey = _mm256_set1_epi16(subkey);
    even_sequence = _mm256_xor_si256(even_sequence, _subkey);
    even_sequence = _mm256_and_si256(even_sequence, _tmp);
    odd_sequence = _mm256_i32gather_epi32((int const*)s16, odd_sequence, 2); 
    odd_sequence = _mm256_xor_si256(odd_sequence, _subkey);
    odd_sequence = _mm256_and_si256(odd_sequence, _tmp);
    // rotate
    __m256i hi = _mm256_slli_epi16(even_sequence, 7); 
    __m256i lo = _mm256_srli_epi16(even_sequence, 9); 
    even_sequence = _mm256_or_si256(hi, lo);
    //same for odd
    hi = _mm256_slli_epi16(odd_sequence, 7); 
    lo = _mm256_srli_epi16(odd_sequence, 9); 
    odd_sequence = _mm256_or_si256(hi, lo);
    even_sequence = _mm256_i32gather_epi32((int const*)s16, even_sequence, 2); 
    odd_sequence = _mm256_i32gather_epi32((int const*)s16, odd_sequence, 2); 
    even_sequence = _mm256_and_si256(even_sequence, _tmp);
    odd_sequence = _mm256_slli_epi32(odd_sequence, 16);
    in = _mm256_or_si256(even_sequence, odd_sequence);  

    return in; 
}

but the performance cannot meet requirements, I also think about the bit-slice. I read a paper which can make 128 datas parallel but need some hardware support. i think bit tranpose operation is time-consuming and there are many constraint.

Thank you very much!

Bai
  • 115
  • 7
  • How many bytes/bits are encrypted using the same subkey? (You suggest 16 but Kasumi AFAIK encrypts data in 8-byte chunks?) And how about adding the sbox generator functions? – Aki Suihkonen Aug 16 '17 at 03:50
  • So all the three variants are equivalent? Did you profile them, can you share some numbers, how much they differ, or eventually where most of the time is spend? Or if possible (short enough), add some init code to make the code [MCVE], so one can try himself, but still adding some context about your current position (how much you are off the requirements) would be nice. – Ped7g Aug 16 '17 at 10:58
  • 1
    @Bai, can you add some numbers of L1 cache misses? – Surt Aug 26 '17 at 19:08
  • Use C11 `_Alignas(32) u16 arr[16];` to make sure the 256b store doesn't fault. (If it didn't already, maybe you're using a compiler that compiles `_mm256_store_si256` to `vmovdqu`. Some compilers (like gcc) will compile it to `vmovdqa`, so it will fault on unaligned instead of potentially running slower (e.g. for a cache-line split). I guess that's just your reference implementation, not what you're optimizing, but on CPUs with slow gathers it might be better. Do the first 2 elements with a `_mm_cvtsi128_si32` (and unpack with scalar mask/shift) so you get started with lower latency. – Peter Cordes Aug 27 '17 at 14:58

1 Answers1

1

This piece of code might explain the performance problem along with your comment below it.

static u16 FI(__m256i in, u16 subkey) {
    u16 arr[16];        
    _mm256_store_si256((__m256i*)arr, in);
    u8 i;           
    for(i = 0; i < 16; i++)
    {
        arr[i] = (u16)(s16[arr[i]] ^ subkey);
        arr[i] = (arr[i] << 7) | (arr[i] >> 9);
        arr[i] = s16[arr[i]];
    }
    in = _mm256_load_si256((__m256i*)arr);
}

S16-box has 65536 elements, so maybe some cache miss will happen.

The average x64 processor has only 32KB of L1 (AMD's sometimes has 64K but lets ignore that for now).

This means that with random access patterns your 64K array will get a cache hit rate of 32KB/64KB * 100% = 50%, if no other data structures uses any L1 and your not running hypertrhreading which could also use some L1 on the other thread.

Lets simplify this to saying that you only have 16KB of the 64KB giving a 75% miss chance for each access. So your loops has data dependencies between each line, ie. the next statement can't start before the previous is done. Luckily each iteration is data independent from the others.

arr[i] = (u16)(s16[arr[i]] ^ subkey);
arr[i] = (arr[i] << 7) | (arr[i] >> 9);
arr[i] = s16[arr[i]];

arr will almost certainly be in L1 cache at this time, incurring only a 4 cycle startup cost, each access to s16 will on average cost 0.25*4+0.75*12 = 1+9 = 10 cycles. This gives the following approximate latency cost for each statement (ignoring the cost of the stores and reloads of arr[i], presume arr[i] is stored in a register)

arr[i] = (u16)(s16[arr[i]] ^ subkey); // arr: 4 + S16: 10 + ^:1
arr[i] = (arr[i] << 7) | (arr[i] >> 9); // << : 1 + |: 1
arr[i] = s16[arr[i]]; // s16 : 10 + store arr : 4

31 cycles latency for each iteration, luckily there is no data dependencies between each iteration. Each iteration takes approximately 3 cycles to issue, so the last will be finished in ~3*16+31=79 cycles assuming perfect branch prediction and ignoring the data hazards on the assignment of in at the end.

Your next code which I presume is this loop rewritten to AVX2 will have a lot of the same load dependencies and exactly the same cache misses, the loop overhead will be gone but some longer latency AVX instruction might increase the time. The average time will still be the ~31 cycles latency + some AVX latency + 16 loads / (max 2 load per cycles), lets say 40 cycles.

If you had not merged the S7 and S9 they would only take (128+512)*2 bytes and would almost certainly alway be in the L1 cache when you run a longer coding. The loop latency would then fall to half at the cost of double the number of loads and your full AVX to something like 15 + 32 load / 2 per cycle, lets say 30 cycles.

The good news is that each 16 byte iteration seems to be independent from the previous so they could overlap in time. But your are ultimately limited by the amount of loads and stores, one initial load, 32 loads from s7+s9 and one store, with a max of 2 store or load limit the best possible throughput to 16 bytes / ((1+32+1)/2) cycles.

This is making a lot of optimistic assumptions, only real measurements of the 2 different codes (s16 vs s7+s9) can decide what is best.

Surt
  • 15,501
  • 3
  • 23
  • 39
  • Nice analysis of the cache-miss problem. But you're mixing up latency and throughput when you're describing it. Latency isn't a problem when out-of-order execution can hide it. It's not latency directly that's a problem here, because each iteration is independent. I think limited memory concurrency will be the real bottleneck here; for example a Skylake CPU only has 10 line-fill buffers to track outstanding requests for L1D cache lines. So you can't keep nearly as many cache misses in-flight as this code can generate. (See also [Latency Bound](https://stackoverflow.com/a/43574756/224132)) – Peter Cordes Aug 27 '17 at 15:08
  • The 31 cycles latency will fill a lot in the ROB lowering the throughput. The 10 outstanding limited will just make it worse. – Surt Aug 27 '17 at 15:29
  • 1
    The OP didn't say what hardware, but even Sandybridge has a 168 entry ROB. (And a 160 entry integer PRF). But yeah, the ROB and/or RS will fill up from cache-miss latency before they could get enough iterations in flight for 2 loads per clock throughput. But remember that's an *average* latency, and when one iteration stalls, the outstanding loads in other iterations can make progress. – Peter Cordes Aug 27 '17 at 15:37
  • Anyway, that many cache misses are going to suck, but a simple analysis assuming latency = average might not tell you exactly which microarchitectural resource you bottleneck on. – Peter Cordes Aug 27 '17 at 15:39
  • Thank you very much, I have tried to use use s7+s9,the miss chance can be decreased. but according my experiment,the result became worse. maybe in other parts of my code, I use Avx instruction. – Bai Sep 03 '17 at 03:20