Given an array of bytes, how can I create a byte where each bit contains the mode (average) value for bits in that position from the input?

Question

I'm trying to figure out quick and easy way to find the mode ("average") of each bit in an array of bytes.

Here is an example of what I an looking for:

Byte 1  1010 1010
Byte 2  0101 0101
Byte n  1010 1000

Result  1010 1000

So if the bit position contains mostly 1's, the bit position in the answer is 1. If the bit position contains mostly 0's, the answer is 0. If the 1's and 0's are in equal occurrence then I don't care what value gets put in that position in the answer.

The order of magnitude of the number of inputs is small for my use-case (around 10 to 20 inputs), but discussion on performance of your approach as it scales with number of inputs is welcome.

I can manually do a count of each 1 and each 0 and work it out that way, but I was hoping that there would be a more elegant and possibly faster way to do it.

Instead of counting 1s and 0s, you could probably just keep a single counter for each bit position, increment it when you see a 1, decrement it when you see a 0. But I don't think there is a bit-twiddling hack that would allow you to do this without checking individual bits. — vgru, Mar 28 '23 at 07:34
You can alternatively phrase this as: Find a constant *c* such that (b1 ^ c) + (b2 ^ c) + ... + (bn ^ c) is minimized. (when c has this "average" value, it turns off the most bits) — Artyer, Mar 28 '23 at 09:07
@Vernon Have a look at my answer (and the link). For 10 I see unrolling for my solution, which is good. For 20 you have some branches, but none that are data-dependent -> so they are predictable. — bitmask, Mar 28 '23 at 09:52
@Artyer That's a very astute observation. I think each summand should be wrapped in `popcount`, otherwise high bits dominate over low bits because of the way `+` works. — bitmask, Mar 28 '23 at 10:14
For large inputs, use a variation of a Harley-Seal population count. A vertical counter using a [carry-save adder](https://en.wikipedia.org/wiki/Carry-save_adder) network could be [50 times faster](https://arxiv.org/abs/1911.02696#) than a naive solution. However, with small arrays of unknown size it probably doesn't help much at all. — aqrit, Mar 29 '23 at 22:09

score 2 · Answer 1 · edited Mar 29 '23 at 02:30

2

The assumption is that bytes is a vector<uint8_t> or array<uint8_t>.

Keep a table called counts of 8 integers all initialized to 0. Pseudocode:

For each byte B in the input set of bytes
   for each bit b at index i in B
       if b is set:
          counts[i]++

Then use counts to build the final result:

vector<int> counts(8);
uint8_t result = 0;

for (uint8_t b : bytes)
{
    uint8_t mask = 0x80;
    size_t needed = (bytes.size() +1) / 2;

    for (size_t i = 0; i < 8; i++)
    {
        counts[i] += (mask & b) ? 1 : 0;
        if (counts[i] >= needed)
        {
            result = mask | result;
        }
        mask = mask >> 1;
    }
}

std::cout << "Result: " << result << "\n";

edited Mar 29 '23 at 02:30

user16217248

3,119
19
19
37

answered Mar 28 '23 at 07:47

selbie

100,020
15
103
173

`vector counts;count.resize(bytes.size());` ==> `vector counts(8, 0);`. There are always 8 bits in a byte, independently on how many bytes you want to check. As it is fixed size, you could use an `array` for that. – mch Mar 28 '23 at 07:52
@mch - there are indeed 8 bits in a byte. But there up to `n` bytes to be considered for the average. In the above example, `n == bytes.size()` – selbie Mar 28 '23 at 07:58
When `bytes` only has 4 values, it will crash for `i > 3` at `counts[i] +=...` https://godbolt.org/z/hhc9c9xar – mch Mar 28 '23 at 08:01
Doh! you are right. Fixed – selbie Mar 28 '23 at 08:14
If `bytes.size() == 3` as in the OP's example, then `if (counts[i] >= bytes.size()/2)` will exit when `counts[i] >= 1` but it should be exiting when `counts[i] >= 2`. We need to handle the odd `bytes.size()` differently right? – Stephen Quan Mar 28 '23 at 08:19
1

@StephenQuan - I believe you are correct. I'm having a bad night. :) – selbie Mar 28 '23 at 08:23
2

Please do not throw a blob of code at the readers. Explain the main idea behind the algorithm. – j6t Mar 28 '23 at 08:40
This answer is wrong and should be edited or deleted. Already edited 2 times. Still wrong. `result = mask & result;` is wrong. Always 0. And the output of an unsigned char is not an integer. Also no explanation, just an incomplete code dump. No comment in the code. How can such an answer get 2 upvotes? That is more than strange. Either the upvoters are stupid or high rep people get simply upvoted by other high rep people. This has definitely a smell. Flagged as "not an answer" – A M Mar 28 '23 at 09:48
Ok. Fixed (again) – selbie Mar 28 '23 at 20:40

nielsen · Answer 2 · 2023-03-28T09:29:54.433

A solution for both C and C++ (since I did not notice the C++ tag until later):

#include <stdio.h>
#include <stdint.h>

uint8_t bitvote(uint8_t *bytes, unsigned n)
{
    uint8_t bit = 1;
    uint8_t result = 0;
    // Iterate over each bit position
    for(unsigned i = 0; i < 8*sizeof(uint8_t); i++)
    {
        // For the current position, gather a "vote" count from each input
        // byte. A 0-bit counts as -1 and a 1-bit as +1.
        int vote = 0; 
        for(unsigned c = 0; c < n; c++)
        {
            vote += (bytes[c] & bit) ? 1 : -1;
        }
        // If the "vote" is larger than 0, there is a majority of 1-bits, so
        // set the corresponding bit in the result.
        // On equal count of 0's and 1's, the resulting bit is 0. To change
        // that to 1, use "vote >= 0".
        if(vote > 0) result |= bit;
        bit = bit << 1;
    }
    return result;
}

int main()
{
    uint8_t bytes[] = {0xaa, 0x55, 0xa8};  // Test vector of the OP
    uint8_t result = bitvote(bytes, sizeof(bytes)/sizeof(bytes[0]));
    printf("0x%02x", result);   // Prints 0xa8 matching the OP expected result

    return 0;
}

This approach is rather straightforward, but I do not think it is possible to make some smart bit-manipulating moves. A partial argument is the following:

Consider two bits so the possible combinations are 00, 01, 10, 11. This gives 3 possibilities: majority of 0's, equal count and majority of 1's. Hence it is not possible to compress this information into a single bit. So if we process the input bytes one by one, we cannot have intermediate states which are just one byte in size.

Looks like a manual count. If the it is not, please do not throw a blob of code at the readers. Explain the main idea behind the algorithm. — j6t, Mar 28 '23 at 08:42
@j6t Fair point. Admitted, I was in a bit of a hurry with this answer. I have added comments to explain the workings of the code and a justification of the approach which is not exactly what the OP wished for, but I doubt it can be significantly improved using standard C/C++ (at least for moderate input sizes). — nielsen, Mar 28 '23 at 09:33

Stephen Quan · Answer 3 · 2023-03-28T09:04:43.303

If you have a for loop that iterates through 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01, we can use it as a mask that checks each bit in the data:

    for (unsigned char mask = 0x80; mask; mask >>= 1 ) {
        if (data & mask) {
            // 1-case
        } else {
            // 0-case
        }
    }

We can refactor the if statement by using the conditional, i.e.

 (data & mask) ? /* 1-case */ : /* 0-case */

When counting the bits, we only have to count enough until we can determine the majority for a specific bit before moving to the next bit. We can consider an early exit implementation if it becomes obvious which way the data is leaning. If it's a nail-biting average, then, we have no choice but to iterate through all of the records.

A majority is reached when:

majority == n/2     , n even
majority == (n+1)/2 , n odd

When doing integer math, we can cover both cases with the following statement:

int majority = (n+1) >> 1;

When we have identified a 1-case, then, we can set a bit in the result with:

result |= mask;

Here's a working c++ example:

#include <stdio.h>
#include <bitset>
#include <iostream>

void print_data(unsigned char data) {
    std::cout << std::bitset<8>(data) << '\n';
}

void print_arr(unsigned char*data, int n) {
    for (int i = 0; i < n; i++, data++) {
        std::cout << "Byte " << i << " ";
        print_data(*data);
    }
}

unsigned char mean_arr(unsigned char*data, int n) {
    int majority = (n + 1) >> 1;
    unsigned char result = 0;
    for (unsigned char mask = 0x80; mask; mask >>= 1 ) {
        int count[2] = {0,0};
        unsigned char* ptr = data;
        for (int i = 0; i < n; i++, ptr++) {
            if (++count[*ptr & mask ? 1 : 0] < majority) continue;
            if (*ptr & mask) result |= mask;
            break;
        }
    }
    return result;
}

int main() {
    unsigned char data[3] = { 0xaa, 0x55, 0xa8 };
    print_arr(data, 3);
    unsigned char result = mean_arr(data, 3);
    std::cout << "Result ";
    print_data(result);
    return 0;
}

Which outputs:

Byte 0 10101010
Byte 1 01010101
Byte 2 10101000
Result 10101000

score 1 · Answer 4 · answered Mar 28 '23 at 08:35

For standard C++, there are plenty of answers already.

Somebody in the comments said "But I don't think there is a bit-twiddling hack that would allow you to do this without checking individual bits.", which gave me the push to write this, which is based on Intel intrinsics and is not standard C++.

You can use PDEP to extract the bits into bytes in a U64 (if you have more than 256 elements in your input, adjust the PDEP accordingly to use multiple U64s with increased bit-width for each bit-accumulator "lane").

Wikipedia link for Intel BMI, Intel intrinsics reference, related Stack Overflow question.

You can use std::transform_reduce with std::execution::par_unseq, with the transform being the PDEP expansion, the reduce being the sum operation (assuming you set the bit-width of each lane wide enough based on the number of inputs that overflow is impossible), and then at the end, if a lane's value is over half the number of input bytes, then set the corresponding bit of output to a 1. There's probably a fancy SIMD way to do that part, but the performance impact of that step is a constant factor (and I'm too lazy to find that way right now).

Your answer seems great, but as a reader I would like to see a code example with it. That would make it perfect! — Fra93, Mar 28 '23 at 08:59

Bob__ · Answer 5 · 2023-03-29T09:19:38.420

Mathematically, you can "expand" the bits of an 8-bit byte into a 64-bit unsigned value with a multiplication and some bit manipulations.

Then you can add up to 255 of those "in parallel" by just adding the 64-bit numbers.

The approach, mentioned by user, is shown in this detailed answer by phuclv (along with some examples of implementations using intrinsics):

https://stackoverflow.com/a/51750902/4944425

Using their formula, we can write the following:

// The following is valid only if the size of the span is less than 256,
// otherwise a temporary array must be used to accumulate the results
// every 255 elements.
auto bits_mode(std::span<std::uint8_t> src) -> std::uint8_t
{
  // Multiplying a byte by this 64-bit constant will "copy" it in 8
  // different positions with a gap of 1 bit between them.
  // 
  // E.g. given x = 10100100
  //                                                           10100100
  // * 1000000001000000001000000001000000001000000001000000001000000001
  // = 0010100100010100100010100100010100100010100100010100100010100100
  //   ^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^ ^^^^^^^^
  constexpr std::uint64_t mult {
    0b10000000'01000000'00100000'00010000'00001000'00000100'00000010'00000001
  };

  // The gaps between the copies is crucial, it let us retrieve the 
  // spread bits (with reversed endianness) with a simple mask:
  //
  // & 1000000010000000100000001000000010000000100000001000000010000000 
  // = 0000000000000000100000000000000000000000100000000000000010000000
  //   ^       ^       ^       ^       ^       ^       ^       ^
  constexpr std::uint64_t mask {
    0b10000000'10000000'10000000'10000000'10000000'10000000'10000000'10000000
  };

  // Shifting the intermediate gives us something we can safely sum up
  // to 255 times.
  //
  // >> 7
  // = 0000000000000000000000010000000000000000000000010000000000000001
  //          ^       ^       ^       ^       ^       ^       ^       ^
  constexpr auto expand = [] (std::uint8_t x) -> std::uint64_t {
    return ((x * mult) & mask) >> 7;
  };

  auto result{ std::transform_reduce( src.begin(), src.end()
                                    , std::uint64_t{}
                                    , std::plus<>{}, expand ) };


  // the resulting byte can be formed a bit at a time "traversing"
  // the bytes of the previous sum.
  std::uint64_t target{ src.size() / 2 };
  std::uint8_t mode{};
  for ( int i{8}; i--; )
  {
      mode |= ((result & 0xff) > target) << i;
      result >>= 8;
  }

  return mode;
}

Live here: https://godbolt.org/z/K1MrYcnvK

Related if you want to take a non-portable (Intel intrinsics), more performant bit-manipulation route: https://stackoverflow.com/a/75863985/11107541 — starball, Mar 28 '23 at 17:08
@user Ah, right, the first snippet in https://stackoverflow.com/a/51750902/4944425 is basically my approach. They avoid the double multiplication (and the double magic numbers) by "storing" the bits backwards, I just missed that, my bad. I'm going to scrap this. — Bob__, Mar 28 '23 at 18:46

bitmask · Answer 6 · 2023-03-28T10:01:06.240

There's a sort-of bit-fiddly approach that would work the same with std::uint8_t and std::uint64_t. Is (almost) branch-free and in-situ. However, there is a caveat that I discuss below.

Idea

If you have a series of bits and you were to sort them, you could look at the mid value and get your average. In fact you can get any quantile you wish. This works because bits are binary.
Example: 01001010001 -> 00000001111 -> mid bit is 0 -> average is 0.

Sorting

If you have two numbers a and b you can sort the bits column-wise by realising that the result of moving all ones "down" and all zeroes "up" is:

// pseudo-swap:
a2 = a&b;
b2 = a|b;

This is guaranteed to preserve the number of ones and zeroes in total and it works in bit-parallel. And is branch free.

Problem

Applying this to n values without branching is not as straightforward as one might think. The trivial solution is to pseudo-swap each value with each other which has quadratic complexity on paper, but is also branch-free (except for the loops which can be branch-predicted or unrolled easily) and requires no additional memory besides the existing array. I am quite certain that there is a better way of doing this, but it's not coming to me. Maybe someone can develop this idea further.

Algorithm

The algorithm turns this:

into this:

00000000
00000000
11100000
11100000 <--- your average
11110001
11111111
11111111

Example implementation

void pseudoswap(std::uint8_t& a, std::uint8_t& b) {
  auto const fst = a & b;
  auto const snd = a | b;
  a = fst;
  b = snd;
}

template <std::size_t N>
void sieve(std::uint8_t (&xs)[N]) {
  for (auto i = 0; i < N; ++i) {
    for (auto j = i+1; j < N; ++j) {
      pseudoswap(xs[i],xs[j]);
    }
  }
}

Have a look at a working demo on compiler explorer which shows some nice optimisation of the main implementation:

.L88:
        mov     rsi, r8
        add     r8, 1
        mov     rax, rsi
.L90:
        movzx   edx, BYTE PTR [rsi-1]
        movzx   ecx, BYTE PTR [rax]
        add     rax, 1
        mov     edi, edx
        or      edx, ecx
        and     edi, ecx
        mov     BYTE PTR [rsi-1], dil
        mov     BYTE PTR [rax-1], dl
        cmp     rbx, rax
        jne     .L90
        cmp     rbx, r8
        jne     .L88

For small array sizes the sieve loops appear to be unrolled without any jumps in the hot path.

Discussion

It sounds bad that this is quadratic. But you say that your input array is very small, between 10 and 20. This means that a complexity of b*n and n*n (with b = number of bits in one word) is indistinguishable because b ~ n. The advantage of doing it this way, is:

no data-dependent branches
no additional memory (no allocations)
no integer division in hot path
can have any quantile (not just average)

If you get to the point where you have millions or billions of bytes, switch to a solution that counts 1s because then the quadratic complexity will become an issue.

If in doubt benchmark!

Hm. Are you sure that this will answer the question of the OP? I checked your code with `std::uint8_t xs[]{1,2,3,128,128};` and could not interprete youre print out. But maybe I am wrong . . . — A M, Mar 28 '23 at 11:17
@Peter Have a look at the compiler explorer link in the answer. It executes the code and runs it. Be aware that average does not mean integer average but bit-wise average. — bitmask, Mar 28 '23 at 12:07

njuffa · Answer 7 · 2023-03-29T21:49:36.097

This is a partial answer as my approach cannot handle an arbitrary number of inputs bytes. It can handle any number of inputs for which an efficient sorting network is known. I am showing examples for arrays with a length of two through sixteen bytes in the code below.

Knuth, TAOCP, vol. 4A, section 7.1.1 explains in much detail the utility of the ternary majority operation ⟨xyz⟩ = (x ∧ y) ∨ (y ∧ z) ∨ (x ∧ z). This function, when applied bitwise to three inputs, will implement the result requested by asker. Knuth prefers to call the function the "median" instead of the "majority" because if one lets ∧ correspond to min and ∨ correspond to max, then ⟨xyz⟩ = y precisely when x ≤ y ≤ z.

The interesting observation here is that one way of constructing sorting networks is from min and max primitives. The median of three median3 (x,y,z) = min (max (min (y, z), x), max (y,z)), while min3 (x,y,z) = min (x, min (y, z)) and max3 (x,y,z) = max (x, max (y, z)).

Knuth points out that any monotone Boolean function can be expressed using only the median operation and the constants 0 and 1. Therefore, the median-of-five can be expressed this way, and according to Knuth (p. 64) the most efficient arrangement is ⟨vwxyz⟩ = ⟨v⟨xyz⟩⟨wx⟨wyz⟩⟩⟩.

As a test case for the derive-ability of the bitwise median computation from sorting networks, I used a nine-input network from the literature and converted it into the bit-wise median-of-nine computation, which delivers the result requested by asker. I also translated some search network graphs I had lying around from previous work into the corresponding sequences of mix / max operations, and translated additional network graphs from TAOCP vol. 3. For additional sorting networks I consulted Bert Dobbelaere's list as noted in code comments. The Wikipedia article on sorting networks suggests that (near) optimal sorting networks are known up to 20 inputs, thus covering the range of array lengths of interest to the asker.

As for efficiency, byte_mode_16() in the code below compiled with Clang 16 compiles to about 170 x86 instructions with a lot of instruction-level parallelism, so I would guesstimate if could execute in about 50 cycles on a modern x86-64 CPU. On NVIDIA GPUs, which support arbitrary three-input logic operations with the LOP3 instruction, the same function compiles to about 80 instructions.

#include <cstdio>
#include <cstdlib>
#include <climits>
#include <cstdint>

#define REPS      (1000000)
#define N         (16)
#define USE_KNUTH (0)

uint8_t byte_mode_ref (const uint8_t *bytes, int n)
{
    int count1 [CHAR_BIT] = {0, 0, 0, 0, 0, 0, 0, 0};
    uint8_t res = 0;

    for (int i = 0; i < n; i++) {
        uint8_t a = bytes [i];
        for (int j = 0; j < CHAR_BIT; j++) {
            uint8_t bit = (a >> j) & 1;
            count1 [j] += bit;
        }
    }
    for (int j = 0; j < CHAR_BIT; j++) {
        res |= (count1[j] > (n / 2)) << j;
    }
    return res;
}

#define MIN3(x,y,z)     (x & y & z)
#define MEDIAN3(x,y,z)  (((y & z) | x) & (y | z))
#define MAX3(x,y,z)     (x | y | z)

#define MIN2(x,y) (x & y)
#define MAX2(x,y) (x | y)
#define SWAP(i,j) do { int s = MIN2(a##i,a##j); int t = MAX2(a##i,a##j); a##i = s; a##j = t; } while (0)

uint8_t byte_mode_2 (const uint8_t *bytes)
{
    int x = bytes [0];
    int y = bytes [1];

    return MIN2 (x, y);
}

uint8_t byte_mode_3 (const uint8_t *bytes)
{
    int x = bytes [0];
    int y = bytes [1];
    int z = bytes [2];

    return MEDIAN3 (x, y, z);
}

uint8_t byte_mode_4 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];

    SWAP (0, 2); SWAP (1, 3);
    SWAP (0, 1); SWAP (2, 3);
    SWAP (1, 2);

    return a1;
}

uint8_t byte_mode_5 (const uint8_t *bytes)
{
#if USE_KNUTH
    int v = bytes [0];
    int w = bytes [1];
    int x = bytes [2];
    int y = bytes [3];
    int z = bytes [4];

    /* Knuth, TAOCP, Vol. 4a, p. 64 */
    return MEDIAN3 (v, MEDIAN3 (x, y, z), MEDIAN3 (w, x, (MEDIAN3 (w, y, z))));
#else // USE_KNUTH
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];

    SWAP (0, 3); SWAP (1, 4);
    SWAP (0, 2); SWAP (1, 3);
    SWAP (0, 1); SWAP (2, 4);
    SWAP (1, 2); SWAP (3, 4);
    SWAP (2, 3);
    
    return a2;
#endif // USE_KNUTH
}

uint8_t byte_mode_6 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];
    int a5 = bytes [5];
  
    SWAP (0, 5); SWAP (1, 3); SWAP (2,4);
    SWAP (1, 2); SWAP (3, 4); 
    SWAP (0, 3); SWAP (2, 5);
    SWAP (0, 1); SWAP (2, 3); SWAP (4, 5);
    SWAP (1, 2); SWAP (3, 4);

    return a2;
}

uint8_t byte_mode_7 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];
    int a5 = bytes [5];
    int a6 = bytes [6];
  
    SWAP (0, 6); SWAP (2, 3); SWAP (4, 5);
    SWAP (0, 2); SWAP (1, 4); SWAP (3, 6);
    SWAP (0, 1); SWAP (2, 5); SWAP (3, 4);
    SWAP (1, 2); SWAP (4, 6);
    SWAP (2, 3); SWAP (4, 5);
    SWAP (1, 2); SWAP (3, 4); SWAP (5, 6);

    return a3;
}

uint8_t byte_mode_8 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];
    int a5 = bytes [5];
    int a6 = bytes [6];
    int a7 = bytes [7];

    SWAP (0, 2); SWAP (1, 3); SWAP (4, 6); SWAP (5, 7);
    SWAP (0, 4); SWAP (1, 5); SWAP (2, 6); SWAP (3, 7);
    SWAP (0, 1); SWAP (2, 3); SWAP (4, 5); SWAP (6, 7);
    SWAP (2, 4); SWAP (3, 5);
    SWAP (1, 4); SWAP (3, 6);
    SWAP (1, 2); SWAP (3, 4); SWAP (5, 6);
    
    return a3;
}

uint8_t byte_mode_9 (const uint8_t *bytes)
{
    int a0 = bytes [0];
    int a1 = bytes [1];
    int a2 = bytes [2];
    int a3 = bytes [3];
    int a4 = bytes [4];
    int a5 = bytes [5];
    int a6 = bytes [6];
    int a7 = bytes [7];
    int a8 = bytes [8];
    int b0, b1, b2, b3, b4, b5, b6, b7, b8;
    int c2, c4, c6;

    /*
      Chaitali Chakrabarti and Li-Yu Wang, "Novel sorting network-based 
      architectures for rank order filters." IEEE Transactions on Very
      Large Scale Integration Systems, Vol. 2, No. 4, 1994, pp. 502-507
    */
    // SORT3 (0, 1, 2)
    b0 = MIN3    (a0, a1, a2);
    b1 = MEDIAN3 (a0, a1, a2);
    b2 = MAX3    (a0, a1, a2);
    // SORT3 (3, 4, 5)
    b3 = MIN3    (a3, a4, a5);
    b4 = MEDIAN3 (a3, a4, a5);
    b5 = MAX3    (a3, a4, a5);
    // SORT3 (6, 7, 8)
    b6 = MIN3    (a6, a7, a8);
    b7 = MEDIAN3 (a6, a7, a8);
    b8 = MAX3    (a6, a7, a8);
    // SORT3 (0, 3, 6)
    // c0 = MIN3    (b0, b3, b6);
    // c3 = MEDIAN3 (b0, b3, b6);
    c6 = MAX3    (b0, b3, b6);
    // SORT3 (1, 4, 7)
    // c1 = MIN3    (b1, b4, b7);
    c4 = MEDIAN3 (b1, b4, b7);
    // c7 = MAX3    (b1, b4, b7);
    // SORT3 (2, 5, 8)
    c2 = MIN3    (b2, b5, b8);
    // c5 = MEDIAN3 (b2, b5, b8);
    // c8 = MAX3    (b2, b5, b8);
    // final median computation
    // SORT3 (2, 4, 6)
    return MEDIAN3 (c2, c4, c6);
}

uint8_t byte_mode_10 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];

#if USE_KNUTH
    /* Knuth, TAOCP, vol. 3, p. 227 */
    SWAP (4, 9); SWAP (3, 8); SWAP (2, 7); SWAP (1, 6); SWAP (0, 5);
    SWAP (1, 4); SWAP (6, 9); SWAP (0, 3); SWAP (5, 8);
    SWAP (0, 2); SWAP (3, 6); SWAP (7, 9);
    SWAP (0, 1); SWAP (2, 4); SWAP (5, 7); SWAP (8, 9);
    SWAP (4, 6); SWAP (1, 2); SWAP (7, 8); SWAP (3, 5);
    SWAP (2, 5); SWAP (6, 8); SWAP (1, 3); SWAP (4, 7);
    SWAP (2, 3); SWAP (6, 7);
    SWAP (3, 4); SWAP (5, 6);
    SWAP (4, 5);
#else // USE_KNUTH
    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0, 8); SWAP (1, 9); SWAP (2, 7); SWAP (3, 5); SWAP (4, 6);
    SWAP (0, 2); SWAP (1, 4); SWAP (5, 8); SWAP (7, 9);
    SWAP (0, 3); SWAP (2, 4); SWAP (5, 7); SWAP (6, 9);
    SWAP (0, 1); SWAP (3, 6); SWAP (8, 9);
    SWAP (1, 5); SWAP (2, 3); SWAP (4, 8); SWAP (6, 7);
    SWAP (1, 2); SWAP (3, 5); SWAP (4, 6); SWAP (7, 8);
    SWAP (2, 3); SWAP (4, 5); SWAP (6, 7);
    SWAP (3, 4); SWAP (5, 6);
#endif // USE_KNUTH

    return a4;
}

uint8_t byte_mode_11 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];

    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0, 9); SWAP (1, 6); SWAP (2,  4); SWAP (3,  7); SWAP (5,  8);
    SWAP (0, 1); SWAP (3, 5); SWAP (4, 10); SWAP (6,  9); SWAP (7,  8);
    SWAP (1, 3); SWAP (2, 5); SWAP (4,  7); SWAP (8, 10);
    SWAP (0, 4); SWAP (1, 2); SWAP (3,  7); SWAP (5,  9); SWAP (6,  8);
    SWAP (0, 1); SWAP (2, 6); SWAP (4,  5); SWAP (7,  8); SWAP (9, 10);
    SWAP (2, 4); SWAP (3, 6); SWAP (5,  7); SWAP (8,  9);
    SWAP (1, 2); SWAP (3, 4); SWAP (5,  6); SWAP (7,  8);
    SWAP (2, 3); SWAP (4, 5); SWAP (6,  7);

    return a5;
}

uint8_t byte_mode_12 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];

#if USE_KNUTH
    /* Knuth, TAOCP, vol. 3, p. 227 */
    SWAP (0, 1); SWAP (2,  3); SWAP (4,  5); SWAP (6,  7); SWAP (8,  9); SWAP (10, 11);
    SWAP (1, 3); SWAP (5,  7); SWAP (9, 11); SWAP (0,  2); SWAP (4,  6); SWAP ( 8, 10);
    SWAP (1, 2); SWAP (5,  6); SWAP (9, 10);
    SWAP (1, 5); SWAP (6, 10); 
    SWAP (2, 6); SWAP (5,  9);  
    SWAP (1, 5); SWAP (6, 10);
    SWAP (0, 4); SWAP (7, 11);
    SWAP (3, 7); SWAP (4,  8);
    SWAP (0, 4); SWAP (7, 11);
    SWAP (1, 4); SWAP (7, 10); SWAP (3,  8);
    SWAP (2, 3); SWAP (8,  9);
    SWAP (2, 4); SWAP (7,  9); SWAP (3,  5); SWAP (6,  8);
    SWAP (3, 4); SWAP (5,  6); SWAP (7,  8);
#else // USE_KNUTH
    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0, 8); SWAP (1,  7); SWAP (2,  6); SWAP (3, 11); SWAP (4, 10); SWAP (5,   9);
    SWAP (0, 1); SWAP (2,  5); SWAP (3,  4); SWAP (6,  9); SWAP (7,  8); SWAP (10, 11);
    SWAP (0, 2); SWAP (1,  6); SWAP (5, 10); SWAP (9, 11);
    SWAP (0, 3); SWAP (1,  2); SWAP (4,  6); SWAP (5,  7); SWAP (8, 11); SWAP (9,  10);
    SWAP (1, 4); SWAP (3,  5); SWAP (6,  8); SWAP (7, 10);
    SWAP (1, 3); SWAP (2,  5); SWAP (6,  9); SWAP (8, 10);
    SWAP (2, 3); SWAP (4,  5); SWAP (6,  7); SWAP (8,  9);
    SWAP (4, 6); SWAP (5,  7);
    SWAP (3, 4); SWAP (5,  6); SWAP (7,  8);
#endif // USE_KNUTH

    return a5;
}

uint8_t byte_mode_13 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];
    int a12 = bytes [12];

    /* Knuth, TAOCP, vol. 3, p. 227 */
    SWAP (0,  5); SWAP ( 8, 12); SWAP (1, 7); SWAP ( 3,  9); SWAP ( 2,  4); SWAP (6, 11);
    SWAP (0,  6); SWAP ( 7,  9); SWAP (1, 3); SWAP ( 5, 11); SWAP ( 2,  8); SWAP (4, 12);
    SWAP (0,  2); SWAP ( 4,  5); SWAP (6, 8); SWAP ( 9, 10); SWAP (11, 12); 
    SWAP (7,  8); SWAP (10, 12); SWAP (5, 9); SWAP ( 3, 11);
    SWAP (1,  5); SWAP ( 9, 11); SWAP (2, 3); SWAP ( 4,  7); SWAP ( 8, 10);
    SWAP (0,  1); SWAP ( 5,  6); SWAP (8, 9); SWAP (10, 11);
    SWAP (3,  6); SWAP ( 2,  5); SWAP (1, 4); SWAP ( 7,  8); SWAP ( 9, 10);
    SWAP (3,  7); SWAP ( 1,  2); SWAP (4, 5); SWAP ( 6,  8);
    SWAP (2,  4); SWAP ( 6,  7); SWAP (3, 5); SWAP ( 8,  9);
    SWAP (3,  4); SWAP ( 5,  6);

    return a6;
}

uint8_t byte_mode_14 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];
    int a12 = bytes [12];
    int a13 = bytes [13];

    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0,  1); SWAP (2, 3); SWAP (4,  5); SWAP (6,  7); SWAP ( 8,  9); SWAP (10, 11);  SWAP (12, 13);
    SWAP (0,  2); SWAP (1, 3); SWAP (4,  8); SWAP (5,  9); SWAP (10, 12); SWAP (11, 13);
    SWAP (0, 10); SWAP (1, 6); SWAP (2, 11); SWAP (3, 13); SWAP ( 5,  8); SWAP ( 7, 12);
    SWAP (1,  4); SWAP (2, 8); SWAP (3,  6); SWAP (5, 11); SWAP ( 7, 10); SWAP ( 9, 12);
    SWAP (0,  1); SWAP (3, 9); SWAP (4, 10); SWAP (5,  7); SWAP ( 6,  8); SWAP (12, 13);
    SWAP (1,  5); SWAP (2, 4); SWAP (3,  7); SWAP (6, 10); SWAP ( 8, 12); SWAP ( 9, 11);
    SWAP (1,  2); SWAP (3, 5); SWAP (4,  6); SWAP (7,  9); SWAP ( 8, 10); SWAP (11, 12);
    SWAP (2,  3); SWAP (4, 5); SWAP (6,  7); SWAP (8,  9); SWAP (10, 11);
    SWAP (3,  4); SWAP (5, 6); SWAP (7,  8); SWAP (9, 10);

    return a6;
}

uint8_t byte_mode_15 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];
    int a12 = bytes [12];
    int a13 = bytes [13];
    int a14 = bytes [14];

    /* https://bertdobbelaere.github.io/sorting_networks.html */
    SWAP (0,  6); SWAP (1, 10); SWAP (2,14); SWAP (3,  9); SWAP ( 4, 12); SWAP ( 5, 13); SWAP ( 7, 11);
    SWAP (0,  7); SWAP (2,  5); SWAP (3, 4); SWAP (6, 11); SWAP ( 8, 10); SWAP ( 9, 12); SWAP (13, 14);
    SWAP (1, 13); SWAP (2,  3); SWAP (4, 6); SWAP (5,  9); SWAP ( 7,  8); SWAP (10, 14); SWAP (11, 12);
    SWAP (0,  3); SWAP (1,  4); SWAP (5, 7); SWAP (6, 13); SWAP ( 8,  9); SWAP (10, 11); SWAP (12, 14);
    SWAP (0,  2); SWAP (1,  5); SWAP (3, 8); SWAP (4,  6); SWAP ( 7, 10); SWAP ( 9, 11); SWAP (12, 13);
    SWAP (0,  1); SWAP (2,  5); SWAP (3,10); SWAP (4,  8); SWAP ( 6,  7); SWAP ( 9, 12); SWAP (11, 13);
    SWAP (1,  2); SWAP (3,  4); SWAP (5, 6); SWAP (7,  9); SWAP ( 8, 10); SWAP (11, 12);
    SWAP (3,  5); SWAP (4,  6); SWAP (7, 8); SWAP (9, 10);
    SWAP (2,  3); SWAP (4,  5); SWAP (6, 7); SWAP (8,  9); SWAP (10, 11);

    return a7;
}

uint8_t byte_mode_16 (const uint8_t *bytes)
{
    int a0  = bytes [0];
    int a1  = bytes [1];
    int a2  = bytes [2];
    int a3  = bytes [3];
    int a4  = bytes [4];
    int a5  = bytes [5];
    int a6  = bytes [6];
    int a7  = bytes [7];
    int a8  = bytes [8];
    int a9  = bytes [9];
    int a10 = bytes [10];
    int a11 = bytes [11];
    int a12 = bytes [12];
    int a13 = bytes [13];
    int a14 = bytes [14];
    int a15 = bytes [15];

    /* Knuth TAOCP, vol. 3, p. 229 */
    SWAP (0,  1); SWAP ( 2,  3); SWAP ( 4,  5); SWAP ( 6,  7); SWAP ( 8,  9); SWAP (10, 11); SWAP (12, 13); SWAP (14, 15);
    SWAP (1,  3); SWAP ( 5,  7); SWAP ( 9, 11); SWAP (13, 15); SWAP ( 0,  2); SWAP ( 4,  6); SWAP ( 8, 10); SWAP (12, 14);
    SWAP (3,  7); SWAP (11, 15); SWAP ( 2,  6); SWAP (10, 14); SWAP ( 1,  5); SWAP ( 9, 13); SWAP ( 0,  4); SWAP ( 8, 12);
    SWAP (7, 15); SWAP ( 6, 14); SWAP ( 5, 13); SWAP ( 4, 12); SWAP ( 3, 11); SWAP ( 2, 10); SWAP ( 1,  9); SWAP ( 0,  8);
    SWAP (1,  2); SWAP ( 3, 12); SWAP (13, 14); SWAP ( 7, 11); SWAP ( 4,  8); SWAP ( 5, 10); SWAP ( 6,  9);
    SWAP (1,  4); SWAP (11, 14); SWAP ( 7, 13); SWAP ( 2,  8); SWAP ( 6, 12); SWAP ( 3, 10); SWAP ( 5,  9);
    SWAP (3,  5); SWAP ( 7,  9); SWAP (11, 13); SWAP ( 2,  4); SWAP ( 6,  8); SWAP (10, 12); 
    SWAP (5,  8); SWAP ( 9, 12); SWAP ( 3,  6); SWAP ( 7, 10); 
    SWAP (3,  4); SWAP ( 5,  6); SWAP ( 7,  8); SWAP ( 9, 10); SWAP (11, 12);

    return a7;
}

// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC  ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
              kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{    
    uint8_t res, ref, arr [N];
    printf ("Testing byte arrays of length %d\n", N);
    for (int i = 0; i < REPS; i++) {
        for (int j = 0; j < N; j++) {
            arr [j] = KISS;
        }
        ref = byte_mode_ref (arr, N);
#if N==2
        res = byte_mode_2 (arr);
#elif N==3
        res = byte_mode_3 (arr);
#elif N==4
        res = byte_mode_4 (arr);
#elif N==5
        res = byte_mode_5 (arr);
#elif N==6
        res = byte_mode_6 (arr);
#elif N==7
        res = byte_mode_7 (arr);
#elif N==8
        res = byte_mode_8 (arr);
#elif N==9
        res = byte_mode_9 (arr);
#elif N==10
        res = byte_mode_10 (arr);
#elif N==11
        res = byte_mode_11 (arr);
#elif N==12
        res = byte_mode_12 (arr);
#elif N==13
        res = byte_mode_13 (arr);
#elif N==14
        res = byte_mode_14 (arr);
#elif N==15
        res = byte_mode_15 (arr);
#elif N==16
        res = byte_mode_16 (arr);
#else
#error unsupported value of N
#endif
        if (res != ref) {
            printf ("Error: res=%02x ref=%02x bytes: ", res, ref);
            for (int j = 0; j < N; j++) {
                printf ("%02x ", arr[j]);
            }
            printf ("\nTest FAILED\n");
            return EXIT_FAILURE;
        }
    }
    printf ("Test PASSED\n");
    return EXIT_SUCCESS;
}