This is pretty complicated, but stay with me. I can't cite any specific place I got this from except "27 years of coding experience."
The original problem has the number line set to natural integers 0,1,2,3,4,5,6... However, we only care about numbers that are divisible by 3, so let's redefine our number line to have only three values in it: {2,3,4} and remap the number line such that:
0 => 4
1 => 2
2 => 3
3 => 4
4 => 2
5 => 3
6 => 4
.. and so on.
You'll notice that numbers that are divisible by 3 are mapped to 4 in our sequence. Why use {2,3,4}? 4 is 100 in binary, which means any element with its 3rd bit set will be divisible by 3. This is easy to test with bit operations.
Since we're using 2,3,4 as the trinary sequence, we can reduce our array element size to 4-bits. We'll define the array as 8-bit values, but half the size as we need in bytes (plus 1 in case it is an odd-sized array), and store the elements two per byte in the array. The adds and compares can be done as SIMD (single instruction, multiple data) operations, incrementing or checking up to 16 elements per loop iteration by using some clever bit operations.
That's the concept. Now down to the code.
First, we need to allocate and initialize our array.
unsigned char *arr = malloc(n/2 + 1);
// Init all element values to 4:
memset(&arr, 0x44, n/2 + 1);
We're going to increment 16 elements at a time by casting a block of 8 array bytes to a uint_64, add 0x1111111111111111
and then skip to the next block. Repeat with 32-bit, 16-bit, 8-bit and 4-bit math for up to 8, 4, 2 or 1 remaining at the end of the operation.
Before each increment, anything that has a value of 4 needs to be decremented by 3 before the increment to keep the numbers in the proper position.
Here's the code (untested) for the increment command:
/**
@param p
p is the address of the byte with the first aligned element to be incremented, &arr[A/2] when A is even, &arr[A/2]+1 when A is odd.
@param j
j is the number of elements to increment. (B-A) when A is even, (B-A-1) when A is odd.
*/
void increment_aligned_block(unsigned char *p, int j)
uint64_t fours;
while (j>16) {
// Find the ones that are value 4
fours = *p & 0x4444444444444444;
// Decrement each that matches by 3
*p -= (fours >> 1 | fours >> 2);
// Add 1 to each of the 16 array elements in the block.
(uint64_t)(*p) += 0x1111111111111111;
p += 8; j -= 16;
}
if (j >= 8) {
// repeat the above for 32-bits (8 elements)
// left as an exercise for the reader.
p += 4; j -= 8;
}
if (j >= 4) {
// repeat the above for 16-bits (4 elements)
// left as an exercise for the reader.
p += 2; j -= 4;
}
if (j >= 2) {
// repeat the above for 8-bits (2 elements)
// left as an exercise for the reader.
p += 1; j -= 2;
}
if (j == 1) {
// repeat the above for 8-bits (1 elements)
// left as an exercise for the reader.
}
}
For the compare use:
/**
@param p
p is the address of the byte with the first aligned element to be counted, &arr[A/2] when A is even, &arr[A/2]+1 when A is odd.
@param j
j is the number of elements to count. (B-A) when A is even, (B-A-1) when A is odd.
*/
int count_aligned_block(unsigned char *p, int j)
int count = 0;
uint64_t divisible_map;
while (j > 16) {
// Find the values of 4 in the block
divisible_map = (uint64_t)(*p) & 0x4444444444444444;
// Count the number of 4s in the block,
// 8-bits at a time
while (divisible_map) {
switch (divisible_map & 0x44) {
case 0x04:
case 0x40:
count++;
break;
case 0x44:
count += 2;
break;
default:
break;
}
divisible_map >>= 8;
}
}
// Repeat as above with 32, 16, 8 and 4-bit math.
// Left as an exercise to the reader
return count;
}
You might have noticed the functions are called foo_aligned_block
and p
needs to be the byte of the first aligned element. What is that? Since we're packing two elements per byte, the starting element index must be aligned to an even number. If the command in the file is 0 0 30
, then we can call increment_algined_block(&arr[A/2], 30)
, no problem. However, if the command in the file is 0 1 30
, then we need to have additional code to handle the unaligned first element at index 1, and then call increment_aligned_block(&arr[A/2 + 1], 29)
. Again, left as an exercise to the reader.
I'd like to note this is not the most optimal it can be.
Unaligned accesses are usually pretty expensive. That is, reading an 8-byte value from an 8-byte aligned address is faster than from a non-aligned address. We can add additional optimizations to only call foo_aligned_block()
such that all accesses are guaranteed to be aligned.