14

I'm wanting to convert the output from gethrtime to milliseconds.

The obvious way to do this is to divide by 1000000. However, I'm doing this quite often and wonder if it could become a bottleneck.

Is there an optimized divide operation when dealing with numbers like 1000000?

Note: Any code must be portable. I'm using gcc and this is generally on Sparc hardware

Some quick testing using the code below... hope that is right.

#include <sys/time.h>
#include <iostream>

using namespace std;

const double NANOSECONDS_TO_MILLISECONDS = 1.0 / 1000000.0;

int main()
{
    hrtime_t start;
    hrtime_t tmp;
    hrtime_t fin;

    start = gethrtime();
    tmp = (hrtime_t)(start * NANOSECONDS_TO_MILLISECONDS);
    fin = gethrtime();

    cout << "Method 1"
    cout << "Original val: " << start << endl;
    cout << "Computed: " << tmp << endl;
    cout << "Time:" << fin - start << endl;

    start = gethrtime();
    tmp = (start / 1000000);
    fin = gethrtime();

    cout "Method 2"    
    cout << "Original val: " << start << endl;
    cout << "Computed: " << tmp << endl;
    cout << "Time:" << fin - start << endl;

    return 0;
}  

Example outputs:

Original val: 3048161553965997
Computed: 3048161553
Time:82082
Original val: 3048161556359586
Computed: 3048161556
Time:31230

Original val: 3048239663018915
Computed: 3048239663
Time:79381
Original val: 3048239665393873
Computed: 3048239665
Time:31321

Original val: 3048249874282285
Computed: 3048249874
Time:81812
Original val: 3048249876664084
Computed: 3048249876
Time:34830

If this is correct, then the multiple by reciprocal is actually slower in this case. It's probably due to using floating point math instead of fixed point math. I will just stick to integer division then which still takes hardly any time at all.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
hookenz
  • 36,432
  • 45
  • 177
  • 286
  • 6
    How often are you doing this? Googling around was showing low tens to 50 clock cycles per division. With processors in the billions of clock cycles and some Sparc hardware having 64 cores per socket, ... – TheJacobTaylor Aug 13 '09 at 04:21
  • +1 to TheJacobTaylor. It seems like a nearly negligible amount of time. Not to mention the fact that Matt seems to be working on a hunch. Have you tried any sort of profiling to see whether or not this truly is a bottleneck? – Eric Aug 13 '09 at 04:35
  • It's used for message timing in a telecoms project. I guess we're talking around 2000 times a second. – hookenz Aug 13 '09 at 04:55
  • As I mentioned already Matt, I did not realize it was integer based. If I had known that, I would not have suggested conversion. At least it looks like your problem has been solved. – TheJacobTaylor Aug 13 '09 at 04:57
  • Your sample output does not tie up with the code that is purported to produce it - the 'Method N' tags are missing. – Jonathan Leffler Aug 13 '09 at 05:35
  • For timing exercises like this, single iterations are almost meaningless. Plus - what is the gethrtime() function? I don't remember it being a system call. I'd expect to see POSIX's clock_gettime() or gettimeofday() being used. – Jonathan Leffler Aug 13 '09 at 05:37
  • Further investigation shows that gethrtime() returns a 64-bit integer valeu of the number of nanoseconds since an arbitrary point in time, not subject to adjustments via adjtime() etc. Introducing doubles into the calculation is probably the mistake - as noted. Solaris's gettimeofday() is exceptionally fast - it is a memory access and not a system call (it is many times faster than Solaris's getpid() system call). This gethrtime() is likely to be similar. – Jonathan Leffler Aug 13 '09 at 05:43

10 Answers10

54

Let your compiler figure it out!

Seriously, if you're really concerned about optimizations at this level (and you shouldn't be unless it shows up in a profile), you should get used to looking at your compiler's assembly language output. You will be amazed what the compiler is doing on your behalf.

All the people who are recommending math tricks either have bad compilers or they are underestimating their compilers. For example, try compiling this function:

unsigned long div1000000(unsigned long n) {
  return n / 1000000UL;
}

Compiled with gcc 4.3.3 on x86 (-O3, -fomit-frame-pointer), I get:

$ objdump -d div.o -M intel

test2.o:     file format elf32-i386


Disassembly of section .text:

00000000 <div1000000>:
   0:   b8 83 de 1b 43          mov    eax,0x431bde83
   5:   f7 64 24 04             mul    DWORD PTR [esp+0x4]
   9:   c1 ea 12                shr    edx,0x12
   c:   89 d0                   mov    eax,edx
   e:   c3                      ret    

In other words, the compiler took n / 1000000UL and turned it into (unsigned long long)(n * 0x431bde83) >> (0x12 + 32). Why does that work? Off the top of my head, I have no idea! But the compiler decided that it would be faster than issuing a native divide.

Moral of the story:

  • don't optimize this unless you're sure it's a bottleneck.
  • don't do fancy arithmetic (multiplying by the reciprocal, shifts, etc) unless you already know what your compiler is doing and you think you can beat it.
  • benchmark the result -- only leave in a wart like fancy bitmath if you have demonstrated that you've outdone your compiler.
Josh Haberman
  • 4,170
  • 1
  • 22
  • 43
  • 1
    A small nitpick - the operation is more like: ((long long)(n * 0x431bde83)) >> (18 + 32), since the shift is 18 bits (0x12) on the high 32-bits of the 64-bit multiplication result. – Michael Burr Aug 13 '09 at 07:40
  • 1
    Thanks for the correction, I thought I might have wrote that down too fast. Though I made one small correction to your correction -- the cast would be to "unsigned long long" since the shift is logical, not arithmetic. – Josh Haberman Aug 13 '09 at 07:59
  • 2
    Why does this work? n / 1000000 = n * (1/1000000) = n * 1125899907 / (2^50) = (n * 1125899907) >> 50. – MSalters Sep 02 '09 at 09:25
  • 2
    It works because `(1<<50)` is 1,125,899,906,842,620 in decimal, which is just a bit less than 1125899907 multiplied by 1000000. So the net result of multiplying by the smaller constant and then dividing by the larger constant is to effectively divide by 1000000. See Potatoswatter's answer for a more generic explanation. – Dave Goodell Mar 04 '11 at 16:49
34

Division is not an expensive operation. I doubt very much if a divide-by-1000000 operation will be anywhere near the main bottleneck in your application. Floating-point processors will be way faster than any sort of "tricks" you can come up with than just doing the single operation.

Robert Cartaino
  • 27,494
  • 6
  • 45
  • 67
  • 2
    Outputing the result will me much more expensive as it is very likely to imply a system call (write()). – Ben Aug 13 '09 at 05:03
  • 3
    Agreed. Premature optimization is the root of all evil. And until you've demonstrated by measurement that one division is the source of your application's performance woes, don't bother with the optimization. – Jonathan Leffler Aug 13 '09 at 05:30
  • 1
    I could however imagine a situation where this could be a performance bottleneck. If your measuring a certain high frequency input and you'll have to register the time yourself. If you'd be using the QueryPerformanceCounter, you could want to translate that to a correct value as fast as possible. – Davy Landman Aug 13 '09 at 07:44
  • Unless you are running on a Niagra processor then ... it is of course a bottleneck. Anf FP is slow regardless. – cb88 Jun 11 '13 at 16:16
15

I'm surprised nobody has gotten this yet…

  • division is the same as multiplication by a fraction
  • multiplying by a fractional power of 2 is fast: just bit-shift
  • integral division involves rounding down
  • rounding down is like multiplying by a slightly smaller fraction (up to a certain point, you need to be aware of your ranges)

So,

const uint64_t numerator = (1LL<<32)/1000000;

...

millionths = ( number * numerator ) >> 32;

Supah fast!

Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • I thought for fun I'd try this. Unfortunately gcc spits out the following error: "warning: left shift count >= width of type" therefore the computed value is 0. I'll stick with integer divide since it's not that slow. – hookenz Aug 13 '09 at 05:09
  • Sorry, it needs to be 1LL instead of just 1, so the literal isn't a 32-bit int. – Potatoswatter Aug 13 '09 at 05:35
  • 1
    I think the compiler in Josh Haberman's answer did basically the same thing. – ilya n. Aug 14 '09 at 00:04
3

Multiply by 1/1,000,000. It should be faster. My Google search was saying to speed up divisions, multiply be the reciprocal. So I would pre-calculate the reciprocal or a list of reciprocals if there is a relatively known set of possible values, and then multiply.

Jacob

TheJacobTaylor
  • 4,063
  • 1
  • 21
  • 23
  • I don't think he is using floating points so he cannot use reciprocals. 1/1,000,000 would be 0. – Unknown Aug 13 '09 at 04:27
  • 4
    Converting an integer divide operation into a floating point multiply operation would probably not be faster, and introduce inaccuracy of floating point math. – scwagner Aug 13 '09 at 04:28
  • 1
    Dividing by huge numbers with integer math just feels like asking for pain. I get back to the "why" question above. Anyways, if you cannot convert the value on "display" instead of having to do it in realtime, an approximation might work. Just call it Milliseconds (notice capital M like MegaBytes). Bit shift the integer to the left by 20 to divide by 1,048,576. – TheJacobTaylor Aug 13 '09 at 04:31
  • Quick google shows you are right....both return an hrtime_t, which is a 64-bit (long long) signed integer. (from Solaris docs) That even makes the bit shift approximation not so cool. – TheJacobTaylor Aug 13 '09 at 04:35
  • You can do reciprocals using integer math. The GCC assembly shows how. The basic trick is not to multiply by 1/N, but by (x/2^y). The "/2^y" part is fast because it's merely a bitshift. – MSalters Sep 02 '09 at 09:28
3

However, I'm doing this quite often and wonder if it could become a bottleneck.

First things first. If you think this will be a bottleneck, profile the code in question and find out for sure.

If, (and only if) this is your bottleneck, then work on improving it.

Now, on to your improvement options:

1. You may not need to convert to milliseconds right away. If you are simply collecting data, just store the full 64-bit number returned from gethrtime() and be done with it. Anything that a human needs to read can be post-processed at a later time, or on a much less agressive update frequency.

2. If you are timing some repetitive event, you can try performing the division on the difference between two calls, which should be very small if you are calling gethrtime() often enough to have a bottleneck:

static hrtime_t oldtime;
hrtime_t newtime = gethrtime();
int milliseconds = fastDivByOneMillion((UI32)(newtime - oldtime));
oldtime = newtime;

3. You can implement fastDivByOneMillion() as a multiplication and a division by a power of 2:

int fastDivByOneMillion(UI32 nanoseconds)
{
    return (int)((UI64)nanoseconds * 4295 >> 32);
}

Notes:

  • Your compiler can figure out the best way to do >> 32 on your hardware. Most of the time, this will be only one or two clock cylces.
  • I used UI32 and UI64 to represent 32 and 64-bit unsigned numbers.
  • All of this will require more profiling to be sure that it is actually producing a measurable improvement.
  • e.James
    • 116,942
    • 41
    • 177
    • 214
    • This actually did not compute the correct result. Also, I would be wanting to avoid function call overhead here... I'll stick with integer division for now. – hookenz Aug 13 '09 at 05:12
    • Hmm, that's strange. `4295 >> 32` works out to (approximately) 10^-6. Were you using the difference (which should be small enough to avoid the risk of overflow) or the full amount? Either way, you probably have the right idea just using integer division :) – e.James Aug 13 '09 at 05:25
    2

    As Joshua Haberman mentioned, your compiler will probably already convert the division by a constant 1000000 to a multiply by a 'magic number' followed by a shift (if the division is an integer operation). You can get more details of what's going on in Henry Warren's "Hacker's Delight" book and at the companion website:

    He even has a page that has a Javascript calculator for the magic numbers:

    Community
    • 1
    • 1
    Michael Burr
    • 333,147
    • 50
    • 533
    • 760
    2

    First, the obvious disclaimer: Unless you perform the division a couple of million times per second at least, it won't be a bottleneck, and you should just leave it. Premature optimization and all that.

    Second, how accurate do you need the result to be? A handy rule of thumb for converting between binary and decimal is that 2^10 ~= 10^3.

    In other words, a million is roughly equal to 2^20. So you could just right shift 20. The compiler won't do this for you automatically, of course, because it changes the result. But if you're willing to live with the slight accuracy, and the division is actually a real performance problem, this would be my suggestion.

    jalf
    • 243,077
    • 51
    • 345
    • 550
    0

    It's possible to transform integer division into a series of simpler operations. A generic method to do that popularized by Terje Mathisen is outlined on page 136 of Optimizing subroutines in assembly language. If you know in advance the width of your data types and what you're dividing by, that will lead you through how to turn that into a serious of simpler operations that in theory might be faster than a more generic division operation that has to handle any divisor. There will still be some platform issues to be concerned about if you're worried about differently sized integers on some of them.

    Unless you're actually programming this in assembly language, I'd bet against you actually improving anything in the process over the SPARC divide implementation. Maybe if you're using a positively ancient SPARC V7 processor, from before division was implemented in hardware, you might get some improvement, but even then I'd bet on the built-in division being faster.

    Regardless, I suspect you've launched into a bit of premature optimization here though. You should start here by profiling the app you've got before presuming this division has any significant impact on its runtime, and you should similarly profile any change to the division to prove it works as expected. It's quite easy to get code that you think will execute faster but actually doesn't nowadays, given how complicated things like CPU caches have gotten.

    Greg Smith
    • 16,965
    • 1
    • 34
    • 27
    0

    If you can get around with that, here's my solution.

    • use integers instead of floats (they are faster)
    • divide by 1048576 by shifting bits to the right (that is cheaper than anything on floats)

    and convince yourself that milliseconds should be base2 and not base10. ;-)

    Marcin
    • 7,874
    • 7
    • 45
    • 49
    0

    1/1000000 is 0.000000000000000000 0100 0011 0001 1011 1101 1110 1000 0010 1101 0111 1011 0110 0011 01 binary - which is 0x431BDE82 * 2^-18

    Therefore n/1000000 is equivalent to (n*0x431BDE82)>>18

    Also n/1000000 is equivalent to (n*0x8637BD04)>>19

    Note that this is a "fixed point" calculation and you should be aware that precision could be lost.

    teambob
    • 1,994
    • 14
    • 19