24

In my program I use a lot of integer division by 10x and integer mod function of power 10.

For example:

unsigned __int64 a = 12345;
a = a / 100;
....

or:

unsigned __int64 a = 12345;
a = a % 1000;
....

If I am going to use the right bit shift >>, then I will get mode of 2x, which is not what I want.

Is there any way I can speed up my program in integer division and mod functions?

phuclv
  • 37,963
  • 15
  • 156
  • 475
kirbo
  • 1,707
  • 5
  • 26
  • 32
  • 6
    Have you run a performance profiler to ensure that the problem is here? – Mark Byers Jan 09 '10 at 11:51
  • Why does your algorithm have such a large dependency on the number of fingers humans have? If humans started to get 12 fingers, then you'd have to change your entire algorithm. Is is possible to rewrite your algorithm to so that this dependency is for input and output only, meaning the rest of the algorithm would work for creatures of any species? It would probably also be faster. – Mark Byers Jan 09 '10 at 11:56
  • 5
    So long as we have grams and Kgm, cm and m, dimes and dollars we'll have algorithms that do lots of base 10 arithmetic. – djna Jan 09 '10 at 12:04
  • 1
    @djna: Convert the units to their base units, perform the algorithm, then convert the result to your desired output unit. Then you only have two places where you work in base 10 - input and output. The entire rest of the algorithm can work with the base units. – Mark Byers Jan 09 '10 at 12:07
  • 1
    We don't know what unknown is trying to do, conceivably he's actually trying to speed up that final conversion from base units (cents) back to dollars and cents. My point is that having base 10 computations in an application is not evidence of unreasonable dependency on number of fingers. – djna Jan 09 '10 at 12:11
  • 2
    Having base 10 computations in an application is indeed OK, *as long as it is restricted to input and output*. If you have a lot of "integer mod function of power 10", for handling things like weights, lengths or currency, that's a code smell. – Mark Byers Jan 09 '10 at 12:20
  • Oh, and I still long for the days of pounds, shillings and pence, not mention farthings. Now back to working on that converter for those 112lb "hundredweights". – djna Jan 09 '10 at 12:22
  • 7
    It hasn't occurred to you that if there is a significantly faster way to divide by powers of ten, **that the compiler might already do it for you**? As long as the right hand operand is a compile-time constant, the compiler knows perfectly well that it is a power of ten, and will do what it can to speed up the process. – jalf Jan 09 '10 at 14:50
  • With Jalf here: No matter what you do the compiler will already have the optimum solution the best you can do is equal it. Also your solution will not be portable while the compiler version just needs a re-compile for the next platform. if by some miracle you find an optimization please report it to the developers of gcc backend and the next version will have it (thus making your hack useless). – Martin York Jan 09 '10 at 16:46
  • @MarkByers: _"If humans started to get 12 fingers"_ Nice try, troll....?! – Lightness Races in Orbit Feb 20 '15 at 12:05
  • I'm 12 years later... but you people do know that all standard integer and floating point types are treated as binary, right? There's no such thing as "convert to base 10" or whatever. There's still a need to do math with number 10, but in binary there's nothing special about 10. There's no reason math would be faster involving number 10 than with any other number. – DeltA Nov 13 '22 at 21:08

10 Answers10

30

Short Answer: NO

Long Answer: NO.

Explanation:
The compiler is already optimizing statements like this for you.
If there is a technique for implementing this quicker than an integer division then the compiler already knows about it and will apply it (assuming you turn on optimizations).

If you provide the appropriate architecture flags as well then the compiler may even know about specific fast architecture specific assembles that will provide a nice trick for doing the operation otherwise it will apply the best trick for the generic architecture it was compiled for.

In short the compiler will beat the human 99.9999999% of the time in any optimization trick (try it remember to add the optimization flag and architecture flags). So the best you can normally do is equal the compiler.

If by some miracle you discover a method that has not already been found by the Assembly boffins that work closely with the backend compiler team. Then please let them know and the next version of the popular compilers will be updated with the 'unknown (google)' division by 10 optimization trick.

Martin York
  • 257,169
  • 86
  • 333
  • 562
  • 7
    I love this answer. Just the right combination of snarkiness and correctness. – jason Jan 09 '10 at 17:05
  • 6
    Is interesting to know so many people based their decisions on faith. – rxantos Mar 31 '14 at 04:33
  • 1
    @rxantos: Is interesting how people make assumptions. – Martin York Mar 31 '14 at 05:41
  • 2
    @LokiAstari It would be interesting to see some evidence to back up all the baseless assertions on SO. 99.9999999% huh? Do you even code bro? – chili Jul 04 '16 at 03:28
  • @chili: Are you trying to assert that humans are better at peephole optimizations than the compiler? Sure the 99.999... is just made up; the actual number is irrelevant (unless I am completely wrong about humans and compilers). Compilers are better at this. Concentrate your effort on what humans are good at (algorithms). – Martin York Jul 04 '16 at 06:14
  • @chili: Or is it this specific case is faster using stacker's `x = divu10(x)` is faster than `x = x/10`? Let's assume it is. Does it make it good advice to use `divu10()` over the normal more readable technique? I don't think so. The compiler can be trained to use this technique and if a faster technique is found it can be trained to use that. So by uisng `x = x/10` your code becomes better as the compiler becomes better (without any maintenance). – Martin York Jul 04 '16 at 07:25
  • 1
    @MartinYork I can see where you're going with this argument and in many ways I agree and think what you're saying is right. However, there are many times when the handcrafted Asm I write, crucifies even a good compiler because of the assumptions it has to make. It all depends on how well you know your assembly in terms of latency, throughput and retire times. With simple maths like the above, it's likely to favour the compiler though, so I agree with you there. But if I need an optimization, I may not have the luxury to wait for the compiler boffins to built it into their compilers. – The Welder Dec 18 '19 at 14:27
15

From http://www.hackersdelight.org/divcMore.pdf

unsigned divu10(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - q*10;
return q + ((r + 6) >> 4);

}
stacker
  • 68,052
  • 28
  • 140
  • 210
  • +1 for actual answer (-dozen to everyone who uses it). Perhaps, the function should be declared as `inline`. – Li0liQ Jan 09 '10 at 11:56
  • 16
    To the people upvoting this, has anyone performance profiled it to make sure that it is actually faster? I cannot see anywhere that the article claims that the code is faster. – Mark Byers Jan 09 '10 at 12:03
  • 2
    Seven shifts, one multiplication, seven add/sub, and looks like 3 registers, sounds like a close call for speed. Probably depends on architecture and the context in which it's used. – James Jan 09 '10 at 12:13
  • 3
    Can't most CPUs do a div in a single clock (possibly pipelined) ? I would downvote this answer and encourage the OP to just use /. – keraba Jan 09 '10 at 14:32
  • 9
    As far as I know, integer division is typically *not* pipelined, and takes upwards of 15 cycles. That's just from memory though, don't quote me on it. But integer division is fairly slow. However, I'm not saying that this answer's algorithm is faster. It'll use more registers, that's for sure. And take up a lot more icache space. – jalf Jan 09 '10 at 14:52
  • If this was faster, the compiler was likely to implement the division with this method. – Mehrdad Afshari Jan 09 '10 at 17:09
  • 5
    It's an alternative whether or not it is faster depends on the processor used, think about embedded systems with weak processors instead of latest Intel or AMD processors – stacker Jan 09 '10 at 17:16
  • stacker: I would say it might work for systems with "weak compilers". To be fair, embedded systems usually have not so great compilers. – Mehrdad Afshari Jan 09 '10 at 17:19
  • 2
    @keraba: Downvote the anwer? You *downwote* answers in such cases? The answer is not incorrect, given the right situation. It might not deserve an upvote, but it certaily doesn't deserve a downvote. – AnT stands with Russia Jan 09 '10 at 17:20
  • @AndreyT: It might not be technically incorrect, but it's certainly incomplete. – jason Jan 09 '10 at 17:37
  • 3
    Am I the only one here that thinks about the possibilities with SSE/AVX? These things don't have a 'div' operator, but they do have all the ones used here. – atlaste Feb 16 '15 at 09:00
  • A little late to the party, but of course this kind of thing is really relevant for embedded systems. It can easily take hundreds of clock cycles for an arbitratry integer division on an architecture (eg. the popular Cortex M0/+) that lacks fast division support. – Adam Brown Jan 22 '16 at 05:02
  • on modern processors from Intel and AMD the above code is not necessarily faster (high chances that it is not) but that's because of availability of powerful math coprocessor. there are whole bunch of embedded systems where coprocessor is simply not available. For example on Arduino forum (https://forum.arduino.cc/index.php?topic=167414.0) people enhanced algorithm from the mentioned book and got very impressive results, far exceeding performance of compiler-generated code. – Noob Dec 17 '19 at 21:05
  • 2
    @keraba No, most CPU's cannot do a division in a single clock, in fact, most can't get even close to doing it in one clock. The x86 takes >30clks to do an integer division and pipelining doesn't buy you much, which is generally why it isn't pipelined. Pipelining just ends up in a cataclysm of stalls on the ALU's so you can never really get much further with this approach. In fact, I code assembly on quite a few architectures and I've never ever come across one that can do a DIV/IDIV in anything close to a single clock. – The Welder Dec 18 '19 at 14:43
7

This is great for environments that lack any div operation and its only ~2x slower than native division on my i7 (optimizations off, naturally).

Here's a slightly faster version of the algorithm, though there are still some nasty rounding errors with negative numbers.

static signed Div10(signed n)
{
    n = (n >> 1) + (n >> 2);
    n += n < 0 ? 9 : 2;
    n = n + (n >> 4);
    n = n + (n >> 8);
    n = n + (n >> 16);
    n = n >> 3;
    return n;
}

Since this method is for 32-bit integer precision, you can optimize away most of these shifts if you're working in an 8-bit or 16-bit environment.

user563910
  • 161
  • 2
  • 2
  • Your function is simply wrong. Invoke it with n=8 and you get 1, but correct result is 0 – Noob Dec 17 '19 at 21:35
6

On a different note instead, it might make more sense to just write a proper version of Div#n# in assembler. Compilers can't always predict the end result as efficiently (though, in most cases, they do it rather well). So if you're running in a low-level microchip environment, consider a hand written asm routine.

#define BitWise_Div10(result, n) {      \
    /*;n = (n >> 1) + (n >> 2);*/           \
    __asm   mov     ecx,eax                 \
    __asm   mov     ecx, dword ptr[n]       \
    __asm   sar     eax,1                   \
    __asm   sar     ecx,2                   \
    __asm   add     ecx,eax                 \
    /*;n += n < 0 ? 9 : 2;*/                \
    __asm   xor     eax,eax                 \
    __asm   setns   al                      \
    __asm   dec     eax                     \
    __asm   and     eax,7                   \
    __asm   add     eax,2                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 4);*/                  \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,4                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 8);*/                  \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,8                   \
    __asm   add     ecx,eax                 \
    /*;n = n + (n >> 16);*/                 \
    __asm   mov     eax,ecx                 \
    __asm   sar     eax,10h                 \
    __asm   add     eax,ecx                 \
    /*;return n >> 3;}*/                    \
    __asm   sar     eax,3                   \
    __asm   mov     dword ptr[result], eax  \
}

Usage:

int x = 12399;
int r;
BitWise_Div10(r, x); // r = x / 10
// r == 1239

Again, just a note. This is better used on chips that indeed have really bad division. On modern processors and modern compilers, divisions are often optimized out in very clever ways.

user563910
  • 161
  • 2
  • 2
3

Short Answer: THAT DEPENDS.

Long Answer:

Yes, it is very possible IF you can use things that the compiler cannot automatically deduce. However, in my experience this is quite rare; most compilers are pretty good at vectorizing nowadays. However, much depends on how you model your data and how willing you are to create incredibly complex code. For most users, I wouldn't recommend going through the trouble in the first place.

To give you an example, here's the implementation of x / 10 where x is a signed integer (this is actually what the compiler will generate):

int eax = value * 0x66666667;
int edx = ([overflow from multiplication] >> 2); // NOTE: use aritmetic shift here!
int result = (edx >> 31) + edx;

If you disassemble your compiled C++ code, and you used a constant for the '10', it will show the assembly code reflecting the above. If you didn't use a constant, it'll generate a idiv, which is much slower.

Knowing your memory is aligned c.q. knowing that your code can be vectorized, is something that can be very beneficial. Do note that this does require you to store your data in such a way that this is possible.

For example, if you want to calculate the sum-of-div/10's of all integers, you can do something like this:

    __m256i ctr = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    ctr = _mm256_add_epi32(_mm256_set1_epi32(INT32_MIN), ctr);

    __m256i sumdiv = _mm256_set1_epi32(0);
    const __m256i magic = _mm256_set1_epi32(0x66666667);
    const int shift = 2;

    // Show that this is correct:
    for (long long int i = INT32_MIN; i <= INT32_MAX; i += 8)
    {
        // Compute the overflow values
        __m256i ovf1 = _mm256_srli_epi64(_mm256_mul_epi32(ctr, magic), 32);
        __m256i ovf2 = _mm256_mul_epi32(_mm256_srli_epi64(ctr, 32), magic);

        // blend the overflows together again
        __m256i rem = _mm256_srai_epi32(_mm256_blend_epi32(ovf1, ovf2, 0xAA), shift);

        // calculate the div value
        __m256i div = _mm256_add_epi32(rem, _mm256_srli_epi32(rem, 31));

        // do something with the result; increment the counter
        sumdiv = _mm256_add_epi32(sumdiv, div);
        ctr = _mm256_add_epi32(ctr, _mm256_set1_epi32(8));
    }

    int sum = 0;
    for (int i = 0; i < 8; ++i) { sum += sumdiv.m256i_i32[i]; }
    std::cout << sum << std::endl;

If you benchmark both implementations, you will find that on an Intel Haswell processor, you'll get these results:

  • idiv: 1,4 GB/s
  • compiler optimized: 4 GB/s
  • AVX2 instructions: 16 GB/s

For other powers of 10 and unsigned division, I recommend reading the paper.

atlaste
  • 30,418
  • 3
  • 57
  • 87
3

You can also take a look at the libdivide project. It is designed to speed-up the integer division, in the general case.

Sylvain Defresne
  • 42,429
  • 12
  • 75
  • 85
2

Not unless you're architecture supports Binary Coded Decimal, and even then only with lots of assembly messiness.

James
  • 24,676
  • 13
  • 84
  • 130
1

In fact you don't need to do anything. The compiler is smart enough to optimize multiplications/divisions with constants. You can find many examples here

You can even do a fast divide by 5 then shift right by 1

phuclv
  • 37,963
  • 15
  • 156
  • 475
1

If the divisor is an explicit compile-time constant (i.e. if your x in 10^x is a compile-time constant), there's absolutely no point in using anything else than the language-provided / and % operators. If there a meaningful way to speed them up for explicit powers of 10, any self-respecting compiler will know how to do that and will do that for you.

The only situation when you might think about a "custom" implementation (aside from a dumb compiler) is the situation when x is a run-time value. In that case you'd need some kind of decimal-shift and decimal-and analogy. On a binary machine, a speedup is probably possible, but I doubt that you'll be able to achieve anything practically meaningful. (If the numbers were stored in binary-decimal format, then it would be easy, but in "normal" cases - no.)

Eran
  • 387,369
  • 54
  • 702
  • 768
AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
0

If your runtime is genuinely dominated by 10x-related operations, you could just use a base 10 integer representation in the first place.

In most situations, I'd expect the slowdown of all other integer operations (and reduced precision or potentially extra memory use) would count for more than the faster 10x operations.

Useless
  • 64,155
  • 6
  • 88
  • 132