My goal is as the following,
Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.
For example, if the range can be represented simply by an unsigned
, then
void increment (unsigned &n) {++n;}
is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,
typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
if (ctr[0] < max) {++ctr[0]; return;}
if (ctr[1] < max) {++ctr[1]; return;}
if (ctr[2] < max) {++ctr[2]; return;}
if (ctr[3] < max) {++ctr[3]; return;}
ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}
So if ctr
start with all zeros, then first ctr[0]
is increased one by one until it reach max
, and then ctr[1]
, and so on. If all 256-bits are set, then we reset it to all zero, and start again.
The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,
void increment (ctr_type &ctr)
{
std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
if (k < 4)
++ctr[k];
else
memset(ctr.data(), 0, 32);
}
If the counter is only manipulated with the above increment
function, and always start with zero, then ctr[k] == 0
if ctr[k - 1] == 0
. And thus the value k
will be the index of the first element that is less than the maximum.
I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.
However, both clang
, gcc
, or intel icpc
generate binaries that the second was much faster.
My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.
What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K
where K > 1
, but almost always remain a constant. My current implementation is similar to the above.
To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i
), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits
integer is generated. If I generate one __m128i
integer at once, then the cost of increment
matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4]
, initialized equivalent to the following,
block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);
And each time I need new output, I increment
each block[i]
by 4, and generate 4 __m128i
output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment
is indeed the one slow down the program. Since the loading into __m128i
and store the __m128i
output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i
, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.
So to summary, my main question is that, if anyone know a way to improve the above counter implementation.