I have been pounding my head against a wall trying to solve this issue for about a month now and neither my C skills nor my google-fu has been strong enough to come up with a solution.
One of my favorite side projects has been and continues to be attempting to find a palindrome for the number 196 via the reverse and add method:
196 + 691 = 887
887 + 788 = 1675
And on and on until the result reads the same front to back.
Lately, my approach of choice has been to use cuda but I've run into the same snag over and over. The carry.
Backing up, I represent the number in memory as an array of unsigned char, each digit is one char - so essentially unpacked bcd.
__device__ __align__(4) unsigned char DigitArray[1024 * 1024];
The partial sum generation is easy in parallel. I store the current digit count in device memory, then on each iteration:
__device__ int DigitCount;
__global__ void PartialSums()
{
int idx = GlobalThreadIndex();
int rev = DigitCount - (1 + idx);
unsigned char sum = DigitArray[idx];
__threadfence();
if(rev >= 0)
{
sum += DigitArray[rev];
}
DigitArray[idx] = sum;
}
All very well and good.
Now for the carry.
In a perfect world, I would want the following to happen:
__device__ unsigned int SumScratch[1024*256];
__global__ void Carry()
{
int idx = GlobalThreadIndex();
SumScratch[idx] = 0xF6F6F6F6;
__threadfence();
unsigned int * ptr = (unsigned int *)(DigitArray + (idx * size of(unsigned char));
SumScratch[idx] += *ptr;
__threadfence();
unsigned int cMask = __vcmples(SumScratch[idx], 0x0A0A0A0A);
unsigned int nCMask = ~cMask;
*ptr = __vadd4((SumScratch[idx] & cMask), __vsub4((SumScratch[idx] & nCMask), (OxF6F6F6F6 & nCMask)) & nCMask);
}
In this perfect world, the line
SumScratch[idx] += *ptr;
Would would overflow into the next byte if the most significant byte in *ptr was greater than 9.
That doesn't happen though, so the indicated line can be replaced with:
unsigned int val = *ptr;
unsigned int ret = 0;
unsigned int carryOut = 0;
asm("{"
"add.cc.u32 %0, %2, %3;"
"addc.cc.u32 %1, 0, 0;"
"}"
: "=r"(ret), "=r"(carryOut)
: "r"(val), "r"(OxF6F6F6F6)
);
SumScratch[idx] = 0;
__threadfence();
atomicAdd(&(SumScratch[idx]), ret);
atomicadd(&(SumScratch[idx+1]), carryOut);
Followed by all the simd instructions for masking.
The goal of this being, if you had: (Most Sig to Least)
0x00090401 0x09090909 0x10081204
Then the most significant byte of the least sufficient int, when added to F6 would result in a carry out into the middle int's sums (pist addition to F6... all bytes are FF) resulting in all its bits flipping to 0 and it carrying out into the most significant int.
So, in reality, I want to treat the whole array as though it was just a single binary sequence and allow bits to just flip.
Any thoughts or ideas would be greatly appreciated.