6

What's the fastest way to implement

template <typename T>
unsigned highest_decimal_digit(T x);

(which returns e.g. 3 for 356431, 7 for 71 and 9 for 9)?

The best I can think of is:

  • constexpr-calculate the "middle-size" power of 10 which fits into T.
  • perform a binary search (over the powers of 10, possibly using a constexpr-constructed lookup table) to find p, the highest power-of-10 lower than x.
  • return x divided by p

... but maybe there's another approach.

Notes:

  • I expressed the question and my approach in C++14ish terms, and a solution in code would be nice, but an abstract solution (or even a solution in x86_64 assembly) would be fine. I do want something that will work for all (unsigned) integer types, though.
  • You may ignore signed integral types.
  • I didn't specify what "fast" is, but be hardware-conscious please.
einpoklum
  • 118,144
  • 57
  • 340
  • 684

4 Answers4

2

The best option seems to be to combine CLZ approach and divide by precalculated power of 10. So, in pseudocode:

powers10=[1,1,1,1,10,10,10,10,100,100...]; // contains powers of 10 map to CLZ values
int firstDigit(unsigned INT_TYPE a) {
    if (a<10) return a; // one digit, screw it
    int j=typesize(a)-clz(a);
    if (j==3) return 1; // 10-16 hit this, return 1
    int k=a/powers10[j];
    if (k>9) return 1; else return k;
}

typesize() returns 64 for long long, 32 for int and 16 for short int.

Vesper
  • 18,599
  • 6
  • 39
  • 61
  • That should be really slow, as it requires reading from RAM. If you run it repeatedly it will be somewhat better (L1 hopefully), but I still doubt very much that's the fastest way to do it. – einpoklum Jul 28 '16 at 16:45
  • @einpoklum Well, hardware-wise, it's cheaper to stuff the whole table into L1 than perform calculation of the power of ten that's needed to divide, and dividing is always more costly than multiplying, especially by constant, so dividing by 10 repeatedly will be more expensive than dividing once with one RAM access. Also, the `powers10` can be located on stack, as it's pretty small, a mere 64x8=512 bytes, and the stack is likely located on L1 at whatever system that has L1 cache. – Vesper Jul 29 '16 at 06:22
  • There are alternatives without the repeated division by 10 which do not require reading anything from RAM - e.g. exponentiation of 10, using merely multiplications, which are cheap. As for L1 - you might be able to keep your table there, or you might not. – einpoklum Jul 29 '16 at 08:47
  • If you always call function, isn't the stack always on L1 or not? Also, do you really need to batch process this many numbers to have L1 issues relevant? – Vesper Jul 29 '16 at 08:49
  • Well, if you need to make a function call every time then that's a problem in itself. And if you're in another function which works on some data structure and requires repeat calculations of the highest decimal digit - you might be evicting the array from L1 cache repeatedly. Still, I guess you'll most have it in L1. – einpoklum Jul 29 '16 at 09:19
1

Newer x86 chips support an lzcnt instruction that tells you the number of clear bits at the start of an integer. You can access it using built-in compiler functions such as the following (from GCC):

 unsigned short __builtin_ia32_lzcnt_16(unsigned short);
 unsigned int __builtin_ia32_lzcnt_u32(unsigned int);
 unsigned long long __builtin_ia32_lzcnt_u64 (unsigned long long);

You could combine this with a lookup table of 640 values indicating the lower and upper bounds of integers starting with each digit from 0-9 that start with the corresponding number of clear bits. In fact, you could save space by right-shifting the lzcnt value 3 places; the correspondence with first decimal digits will still be unique.

r3mainer
  • 23,981
  • 3
  • 51
  • 88
1

With an lzcnt instruction, you can construct a table of divisors for each number of leading zero bits. For example, for unsigned 64 bit numbers:

lz | range   | div
---+---------+----
64 |   0     |   1
63 |   1     |   1
62 |   2-3   |   1
61 |   4-7   |   1
60 |   8-15  |   1
59 |  16-31  |  10
58 |  32-63  |  10
57 |  64-127 |  10
56 | 128-255 | 100
...
 0 | 9223372036854775808-18446744073709551615 | 1000000000000000000

Then the computation becomes:

leading_zero_bits = lzcnt(x);
leading_digit = x / divisor_table[leading_zero_bits];
if (leading_digit >= 10) leading_digit = 1;

The result of the division will always be less than 20, so only a simple check is needed for quotients between 10 and 19. The division by a constant can also be optimized.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
0

Options that occur to me;

Brute force: keep integer-dividing by 10 until you get zero; that tells you what order of number you're looking at (eg 860 takes 3 shifts (86, 8, 0) so it's a 10^3) then return n/(10^order)

Binary search: as you say, search over powers of 10, but it requires extra variables and assignments and the concern would be, does that extra tracking info pay for itself on the kinds of numbers you care about? Eg, if most of your numbers are small, brute force may just be faster.

Bitshift optimisation: count how many times you need to do x >> 1 until you get to zero; this sets the range for your search. E.g., 94 takes 7 shifts to clear the number. Therefore it's < 128. Therefore, start brute-force searching at 10^3. You'll need a lookup for bits=>order.

Steve Cooper
  • 20,542
  • 15
  • 71
  • 88
  • Division is kind of expensive... I would think that at least for 32-bit and 64-bit numbers it should be worth it to do something smarter than. As for "counting shifts" - there's no need for that, we have CLZ in hardware these days. – einpoklum Jul 28 '16 at 10:32
  • I've not really got enough assembly smarts to help. Except to offer rubbish you might be able to salvage ;) – Steve Cooper Jul 28 '16 at 10:33
  • 1
    That said, if you can do `clz`, then convert that number to a the maximum power of ten, you only have to try a small number of tests, I think. I don't know enough to help out more. – Steve Cooper Jul 28 '16 at 10:36
  • Yep, `clz` seems to be the best approach, and then you have a power of 10 ready to divide, at least half of the time, say if `clz` for a 32-bit number returns 27, the number is in 16-32 region, so dividing by 10 gives you the answer. If `clz` is 25, the region is 64-127, so again divide by 10 and compare, if result is above 9, plain return 1. Same with higher powers of 10, but store them first of course. – Vesper Jul 28 '16 at 12:09