Yes, there's another way, originally invented (at least AFAIK) by Terje Mathiesen. Instead of dividing by 10, you (sort of) multiply by the reciprocal. The trick, of course, is that in integers you can't represent the reciprocal directly. To make up for that, you work with scaled integers. If we had floating point, we could extract digits with something like:
input = 123
first digit = integer(10 * (fraction(input * .1))
second digit = integer(100 * (fraction(input * .01))
...and so on for as many digits as needed. To do this with integers, we basically just scale those by 232 (and round each up, since we'll use truncating math). In C, the algorithm looks like this:
#include <stdio.h>
// here are our scaled factors
static const unsigned long long factors[] = {
3435973837, // ceil((0.1 * 2**32)<<3)
2748779070, // ceil((0.01 * 2**32)<<6)
2199023256, // etc.
3518437209,
2814749768,
2251799814,
3602879702,
2882303762,
2305843010
};
static const char shifts[] = {
3, // the shift value used for each factor above
6,
9,
13,
16,
19,
23,
26,
29
};
int main() {
unsigned input = 13754;
for (int i=8; i!=-1; i--) {
unsigned long long inter = input * factors[i];
inter >>= shifts[i];
inter &= (unsigned)-1;
inter *= 10;
inter >>= 32;
printf("%u", inter);
}
return 0;
}
The operations in the loop will map directly to instructions on most 32-bit processors. Your typical multiply instruction will take 2 32-bit inputs, and produce a 64-bit result, which is exactly what we need here. It'll typically be quite a bit faster than a division instruction as well. In a typical case, some of the operations will (or at least with some care, can) disappear in assembly language. For example, where I've done the inter &= (unsigned)-1;
, in assembly language you'll normally be able to just use the lower 32-bit register where the result was stored, and just ignore whatever holds the upper 32 bits. Likewise, the inter >>= 32;
just means we use the value in the upper 32-bit register, and ignore the lower 32-bit register.
For example, in x86 assembly language, this comes out something like:
mov ebx, 9 ; maximum digits we can deal with.
mov esi, offset output_buffer
next_digit:
mov eax, input
mul factors[ebx*4]
mov cl, shifts[ebx]
shrd eax, edx, cl
mov edx, 10 ; overwrite edx => inter &= (unsigned)-1
mul edx
add dl, '0'
mov [esi], dl ; effectively shift right 32 bits by ignoring 32 LSBs in eax
inc esi
dec ebx
jnz next_digit
mov [esi], bl ; zero terminate the string
For the moment, I've cheated a tiny bit, and written the code assuming an extra item at the beginning of each table (factors
and shifts
). This isn't strictly necessary, but simplifies the code at the cost of wasting 8 bytes of data. It's pretty easy to do away with that too, but I haven't bothered for the moment.
In any case, doing away with the division makes this a fair amount faster on quite a few low- to mid-range processors that lack dedicated division hardware.