53

Our C++ library currently uses time_t for storing time values. I'm beginning to need sub-second precision in some places, so a larger data type will be necessary there anyway. Also, it might be useful to get around the Year-2038 problem in some places. So I'm thinking about completely switching to a single Time class with an underlying int64_t value, to replace the time_t value in all places.

Now I'm wondering about the performance impact of such a change when running this code on a 32-bit operating system or 32-bit CPU. IIUC the compiler will generate code to perform 64-bit arithmetic using 32-bit registers. But if this is too slow, I might have to use a more differentiated way for dealing with time values, which might make the software more difficult to maintain.

What I'm interested in:

  • which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well? Will a normal 32-bit system use the 64-bit registers of modern CPUs?
  • which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?
  • are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?
  • does anyone have own experience about this performance impact?

I'm mostly interested in g++ 4.1 and 4.4 on Linux 2.6 (RHEL5, RHEL6) on Intel Core 2 systems; but it would also be nice to know about the situation for other systems (like Sparc Solaris + Solaris CC, Windows + MSVC).

oliver
  • 6,204
  • 9
  • 46
  • 50
  • 7
    Only careful profiling could tell one way or the other. – Sergey Kalinichenko May 30 '13 at 16:30
  • write two small example, compile them and compare asm codes. I believe this might fall below detection of a profiler tool and comparing the asm code is the best way to go. – andre May 30 '13 at 16:34
  • Is the time processing code the perf bottleneck? – David Heffernan May 30 '13 at 19:50
  • @andre If the difference cannot be timed, why do you care what the asm looks like? In fact why do you ever care about the asm. All you care about is how long anything takes to run. – David Heffernan May 30 '13 at 19:51
  • @DavidHeffernan You would care because the asm would show which instructions your running. – andre May 30 '13 at 21:13
  • @andre that would a different question, but since this is about performance it's not pertinent. What matters is timings. – David Heffernan May 30 '13 at 21:31
  • 3
    Adding to David H and @andre: On modern systems, just looking at what instructions is not enough to decide what the timing of the code is. You may well find that instruction sequences that look equal (have the same number of the same instructions, just different registers being used) run very much different speed - for example because on is depending on the result of a previous operation, another doesn't. Or cache hits/misses affect the result, or some other similar factor. – Mats Petersson May 30 '13 at 22:29
  • 3
    Have you considered using a double? If you just use it for storing integers, it gives you in effect a 53 bit integer which is a considerable improvement over the 32 bits you have now. – tzs Jan 09 '16 at 20:54
  • will you be doing math on these numbers of are they just stored for historical purposes as timestamps? –  Jan 10 '16 at 15:43
  • @JarrodRoberson the sentence "Our C++ library currently uses time_t for storing time values" might have been misleading; by "storing" I meant "keeping in RAM and registers, for calculations" rather than "keeping in non-volatile memory, for archival". So yes, I am doing math on the time values. – oliver Jan 11 '16 at 10:16

4 Answers4

47

which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well?

Mostly the processor architecture (and model - please read model where I mention processor architecture in this section). The compiler may have some influence, but most compilers do pretty well on this, so the processor architecture will have a bigger influence than the compiler.

The operating system will have no influence whatsoever (other than "if you change OS, you need to use a different type of compiler which changes what the compiler does" in some cases - but that's probably a small effect).

Will a normal 32-bit system use the 64-bit registers of modern CPUs?

This is not possible. If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system".

which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?

Addition and subtraction, is worse as these have to be done in sequence of two operations, and the second operation requires the first to have completed - this is not the case if the compiler is just producing two add operations on independent data.

Mulitplication will get a lot worse if the input parameters are actually 64-bits - so 2^35 * 83 is worse than 2^31 * 2^31, for example. This is due to the fact that the processor can produce a 32 x 32 bit multiply into a 64-bit result pretty well - some 5-10 clockcycles. But a 64 x 64 bit multiply requires a fair bit of extra code, so will take longer.

Division is a similar problem to multiplication - but here it's OK to take a 64-bit input on the one side, divide it by a 32-bit value and get a 32-bit value out. Since it's hard to predict when this will work, the 64-bit divide is probably nearly always slow.

The data will also take twice as much cache-space, which may impact the results. And as a similar consequence, general assignment and passing data around will take twice as long as a minimum, since there is twice as much data to operate on.

The compiler will also need to use more registers.

are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?

Probably, but I'm not aware of any. And even if there are, it would only be somewhat meaningful to you, since the mix of operations is HIGHLY critical to the speed of operations.

If performance is an important part of your application, then benchmark YOUR code (or some representative part of it). It doesn't really matter if Benchmark X gives 5%, 25% or 103% slower results, if your code is some completely different amount slower or faster under the same circumstances.

does anyone have own experience about this performance impact?

I've recompiled some code that uses 64-bit integers for 64-bit architecture, and found the performance improve by some substantial amount - as much as 25% on some bits of code.

Changing your OS to a 64-bit version of the same OS, would help, perhaps?

Edit:

Because I like to find out what the difference is in these sort of things, I have written a bit of code, and with some primitive template (still learning that bit - templates isn't exactly my hottest topic, I must say - give me bitfiddling and pointer arithmetics, and I'll (usually) get it right... )

Here's the code I wrote, trying to replicate a few common functons:

#include <iostream>
#include <cstdint>
#include <ctime>

using namespace std;

static __inline__ uint64_t rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}

template<typename T>
static T add_numbers(const T *v, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i];
    return sum;
}


template<typename T, const int size>
static T add_matrix(const T v[size][size])
{
    T sum[size] = {};
    for(int i = 0; i < size; i++)
    {
    for(int j = 0; j < size; j++)
        sum[i] += v[i][j];
    }
    T tsum=0;
    for(int i = 0; i < size; i++)
    tsum += sum[i];
    return tsum;
}



template<typename T>
static T add_mul_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] * mul;
    return sum;
}

template<typename T>
static T add_div_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] / mul;
    return sum;
}


template<typename T> 
void fill_array(T *v, const int size)
{
    for(int i = 0; i < size; i++)
    v[i] = i;
}

template<typename T, const int size> 
void fill_array(T v[size][size])
{
    for(int i = 0; i < size; i++)
    for(int j = 0; j < size; j++)
        v[i][j] = i + size * j;
}




uint32_t bench_add_numbers(const uint32_t v[], const int size)
{
    uint32_t res = add_numbers(v, size);
    return res;
}

uint64_t bench_add_numbers(const uint64_t v[], const int size)
{
    uint64_t res = add_numbers(v, size);
    return res;
}

uint32_t bench_add_mul_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_mul_numbers(v, c, size);
    return res;
}

uint64_t bench_add_mul_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_mul_numbers(v, c, size);
    return res;
}

uint32_t bench_add_div_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_div_numbers(v, c, size);
    return res;
}

uint64_t bench_add_div_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_div_numbers(v, c, size);
    return res;
}


template<const int size>
uint32_t bench_matrix(const uint32_t v[size][size])
{
    uint32_t res = add_matrix(v);
    return res;
}
template<const int size>
uint64_t bench_matrix(const uint64_t v[size][size])
{
    uint64_t res = add_matrix(v);
    return res;
}


template<typename T>
void runbench(T (*func)(const T *v, const int size), const char *name, T *v, const int size)
{
    fill_array(v, size);

    uint64_t long t = rdtsc();
    T res = func(v, size);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}

template<typename T, const int size>
void runbench2(T (*func)(const T v[size][size]), const char *name, T v[size][size])
{
    fill_array(v);

    uint64_t long t = rdtsc();
    T res = func(v);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}


int main()
{
    // spin up CPU to full speed...
    time_t t = time(NULL);
    while(t == time(NULL)) ;

    const int vsize=10000;

    uint32_t v32[vsize];
    uint64_t v64[vsize];

    uint32_t m32[100][100];
    uint64_t m64[100][100];


    runbench(bench_add_numbers, "Add 32", v32, vsize);
    runbench(bench_add_numbers, "Add 64", v64, vsize);

    runbench(bench_add_mul_numbers, "Add Mul 32", v32, vsize);
    runbench(bench_add_mul_numbers, "Add Mul 64", v64, vsize);

    runbench(bench_add_div_numbers, "Add Div 32", v32, vsize);
    runbench(bench_add_div_numbers, "Add Div 64", v64, vsize);

    runbench2(bench_matrix, "Matrix 32", m32);
    runbench2(bench_matrix, "Matrix 64", m64);
}

Compiled with:

g++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std=c++0x

And the results are: Note: See 2016 results below - these results are slightly optimistic due to the difference in usage of SSE instructions in 64-bit mode, but no SSE usage in 32-bit mode.

result = 49995000
Add 32 time in clocks 20784
result = 49995000
Add 64 time in clocks 30358
result = 349965000
Add Mul 32 time in clocks 30182
result = 349965000
Add Mul 64 time in clocks 79081
result = 7137858
Add Div 32 time in clocks 60167
result = 7137858
Add Div 64 time in clocks 457116
result = 49995000
Matrix 32 time in clocks 22831
result = 49995000
Matrix 64 time in clocks 23823

As you can see, addition, and multiplication isn't that much worse. Division gets really bad. Interestingly, the matrix addition is not much difference at all.

And is it faster on 64-bit I hear some of you ask: Using the same compiler options, just -m64 instead of -m32 - yupp, a lot faster:

result = 49995000
Add 32 time in clocks 8366
result = 49995000
Add 64 time in clocks 16188
result = 349965000
Add Mul 32 time in clocks 15943
result = 349965000
Add Mul 64 time in clocks 35828
result = 7137858
Add Div 32 time in clocks 50176
result = 7137858
Add Div 64 time in clocks 50472
result = 49995000
Matrix 32 time in clocks 12294
result = 49995000
Matrix 64 time in clocks 14733

Edit, update for 2016: four variants, with and without SSE, in 32- and 64-bit mode of the compiler.

I'm typically using clang++ as my usual compiler these days. I tried compiling with g++ (but it would still be a different version than above, as I've updated my machine - and I have a different CPU too). Since g++ failed to compile the no-sse version in 64-bit, I didn't see the point in that. (g++ gives similar results anyway)

As a short table:

Test name      | no-sse 32 | no-sse 64 | sse 32 | sse 64 |
----------------------------------------------------------
Add uint32_t   |   20837   |   10221   |   3701 |   3017 |
----------------------------------------------------------
Add uint64_t   |   18633   |   11270   |   9328 |   9180 |
----------------------------------------------------------
Add Mul 32     |   26785   |   18342   |  11510 |  11562 |
----------------------------------------------------------
Add Mul 64     |   44701   |   17693   |  29213 |  16159 |
----------------------------------------------------------
Add Div 32     |   44570   |   47695   |  17713 |  17523 |
----------------------------------------------------------
Add Div 64     |  405258   |   52875   | 405150 |  47043 |
----------------------------------------------------------
Matrix 32      |   41470   |   15811   |  21542 |   8622 |
----------------------------------------------------------
Matrix 64      |   22184   |   15168   |  13757 |  12448 |

Full results with compile options.

$ clang++ -m32 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 20837
result = 49995000
Add 64 time in clocks 18633
result = 349965000
Add Mul 32 time in clocks 26785
result = 349965000
Add Mul 64 time in clocks 44701
result = 7137858
Add Div 32 time in clocks 44570
result = 7137858
Add Div 64 time in clocks 405258
result = 49995000
Matrix 32 time in clocks 41470
result = 49995000
Matrix 64 time in clocks 22184

$ clang++ -m32 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3701
result = 49995000
Add 64 time in clocks 9328
result = 349965000
Add Mul 32 time in clocks 11510
result = 349965000
Add Mul 64 time in clocks 29213
result = 7137858
Add Div 32 time in clocks 17713
result = 7137858
Add Div 64 time in clocks 405150
result = 49995000
Matrix 32 time in clocks 21542
result = 49995000
Matrix 64 time in clocks 13757


$ clang++ -m64 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3017
result = 49995000
Add 64 time in clocks 9180
result = 349965000
Add Mul 32 time in clocks 11562
result = 349965000
Add Mul 64 time in clocks 16159
result = 7137858
Add Div 32 time in clocks 17523
result = 7137858
Add Div 64 time in clocks 47043
result = 49995000
Matrix 32 time in clocks 8622
result = 49995000
Matrix 64 time in clocks 12448


$ clang++ -m64 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 10221
result = 49995000
Add 64 time in clocks 11270
result = 349965000
Add Mul 32 time in clocks 18342
result = 349965000
Add Mul 64 time in clocks 17693
result = 7137858
Add Div 32 time in clocks 47695
result = 7137858
Add Div 64 time in clocks 52875
result = 49995000
Matrix 32 time in clocks 15811
result = 49995000
Matrix 64 time in clocks 15168
Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • So if the CPU running in 32 bit mode affects the answer, doesn't the OS being 32 bit also matter because it mandates 32 bit mode? I don't know too much about this topic, but AFAIK a 32 bit OS generally won't support running *anything* in 64 bit mode. –  May 30 '13 at 17:04
  • You CAN use a mixed 32/64 mode, as the Linux x32 ABI does… – dom0 May 30 '13 at 20:49
  • 2
    Tell me, what bits are set in the code segment selector for `x32`? More specifically, what is the value of bit 53? It is set! In other words, x32 isn't REALLY 32-bit mode. It uses 64-bit registers and 64-bit mode, but 32-bit pointers [sign-extended to 64 bits] and only the first and last 2GB of virtual address space. – Mats Petersson May 30 '13 at 20:56
  • @delnan: I have now added a small home-built benchmark, showing the performance of 32 and 64-bit integer calculations in with a 32-bit and a 64-bit build of the code. – Mats Petersson May 30 '13 at 22:39
  • Mats, thanks for the great benchmark! This at least gives some basic range of performance impact for 32bit systems. Completely switching to native 64bit isn't quite possible at the moment, as we have to support existing systems as well. – oliver May 31 '13 at 09:06
  • 99.9% sure that "existing systems" will run just fine in 64-bit. – Mats Petersson May 31 '13 at 09:18
  • 1
    It strikes me as slightly suspicious that the performance of 32-bit code is so much faster in 64-bit mode. I might imagine that your (trivial) loops are being vectorized - but only in x64, since only x64 *by default* supports vectorization. Vectorized performance certainly deserves it's own analysis. Ideally, you'd want your benchmark to avoid vectorizability initially, and you also want to be less sensitive to loop unrolling (you're benchmarking addition, so an extra increment matters). – Eamon Nerbonne Jan 09 '16 at 23:48
  • @EamonNerbonne: I will re-benchmark. Got new compiler versions and a different processor, so will run a set of four with the SSE/vectorisation on/off. – Mats Petersson Jan 09 '16 at 23:55
  • @EamonNerbonne: You are/were correct, the gcc -m64 also enables SSE by default, which skews the results. – Mats Petersson Jan 10 '16 at 00:53
  • All 64-bit CPUs support SSE, which is why gcc defaults to it. Use `-mno-sse` if you want to benchmark 64-bit code without SSE, but be aware that such a benchmark is meaningless, as there is no reason to disable SSE for 64-bit code in practice. – Konrad Borowski Jan 10 '16 at 14:30
  • @xfix: The purpose here is to show the difference between 32- and 64-bit processor's ability to perform 64-bit operations. If, like my benchmark about 2 and a half years back, you compare 32-bit no-sse with 64-bit sse, it skews the result and we are comparing apples with bananas. Thus I corrected it by showing both with and without sse for both 32- and 64-bit mode. This allows a fair comparison for both cases. – Mats Petersson Jan 10 '16 at 15:55
  • Yeah, I'd say disabling auto-vectorization is sensible. Maybe just use `-fno-tree-vectorize`. In a "real program", it might not be worth the overhead to get numbers into xmm registers just for a couple add / subtracts. Although in 32bit mode, 64bit integer function args arrive on the stack, so the compiler could load them into xmm registers with `movq`, and use `paddq`/`psubq`. However, it's not as efficient to compare integers in xmm registers. So even 32bit that can assume SSE2 isn't going to benefit in "real code", only in a synthetic benchmark like this. The no-SSE results are useful – Peter Cordes Feb 16 '16 at 03:32
  • `-msse2` is the option you should use, not just `-msse`. SSE2 is baseline for AMD64. I'm not 100% sure, but I think `-msse` only enables original-SSE (SSE1, but nobody calls it that). Anyway, SSE1 only supports single-precision float in xmm registers, and some extra integer instructions for MMX registers. SSE2 added vector-integer and double-precision support. (SSSE3 and SSE4.1 also added lots of good stuff. The byte shuffle in SSSE3 isn't usually used in autovectorization, but the much better integer math in SSE4.1 is: 32bit multiply, more sizes of packed min/max, etc.) – Peter Cordes Feb 16 '16 at 03:41
  • `-msse` or `-msse2` gives the same result for this particular code, but is much better than the `mno-sse` or `mno-sse2`. `-march=native` also gives similar results. The whole point was to make a reasonable comparison between the same things. Which I think it makes (I did look at the code generated, and it's pretty much the same between 32- and 64-bit in this case). – Mats Petersson Feb 16 '16 at 07:22
9

More than you ever wanted to know about doing 64-bit math in 32-bit mode...

When you use 64-bit numbers on 32-bit mode (even on 64-bit CPU if an code is compiled for 32-bit), they are stored as two separate 32-bit numbers, one storing higher bits of a number, and another storing lower bits. The impact of this depends on an instruction. (tl;dr - generally, doing 64-bit math on 32-bit CPU is in theory 2 times slower, as long you don't divide/modulo, however in practice the difference is going to be smaller (1.3x would be my guess), because usually programs don't just do math on 64-bit integers, and also because of pipelining, the difference may be much smaller in your program).

Addition/subtraction

Many architectures support so called carry flag. It's set when the result of addition overflows, or result of subtraction doesn't underflow. The behaviour of those bits can be show with long addition and long subtraction. C in this example shows either a bit higher than the highest representable bit (during operation), or a carry flag (after operation).

  C 7 6 5 4 3 2 1 0      C 7 6 5 4 3 2 1 0
  0 1 1 1 1 1 1 1 1      1 0 0 0 0 0 0 0 0
+   0 0 0 0 0 0 0 1    -   0 0 0 0 0 0 0 1
= 1 0 0 0 0 0 0 0 0    = 0 1 1 1 1 1 1 1 1

Why is carry flag relevant? Well, it just so happens that CPUs usually have two separate addition and subtraction operations. In x86, the addition operations are called add and adc. add stands for addition, while adc for addition with carry. The difference between those is that adc considers a carry bit, and if it is set, it adds one to the result.

Similarly, subtraction with carry subtracts 1 from the result if carry bit is not set.

This behaviour allows easily implementing arbitrary size addition and subtraction on integers. The result of addition of x and y (assuming those are 8-bit) is never bigger than 0x1FE. If you add 1, you get 0x1FF. 9 bits is enough therefore to represent results of any 8-bit addition. If you start addition with add, and then add any bits beyond initial ones with adc, you can do addition on any size of data you like.

Addition of two 64-bit values on 32-bit CPU is as follows.

  1. Add first 32 bits of b to first 32 bits of a.
  2. Add with carry later 32 bits of b to later 32 bits of a.

Analogically for subtraction.

This gives 2 instructions, however, because of instruction pipelinining, it may be slower than that, as one calculation depends on the other one to finish, so if CPU doesn't have anything else to do than 64-bit addition, CPU may wait for the first addition to be done.

Multiplication

It so happens on x86 that imul and mul can be used in such a way that overflow is stored in edx register. Therefore, multiplying two 32-bit values to get 64-bit value is really easy. Such a multiplication is one instruction, but to make use of it, one of multiplication values must be stored in eax.

Anyway, for a more general case of multiplication of two 64-bit values, they can be calculated using a following formula (assume function r removes bits beyond 32 bits).

First of all, it's easy to notice the lower 32 bits of a result will be multiplication of lower 32 bits of multiplied variables. This is due to congrugence relation.

a1b1 (mod n)
a2b2 (mod n)
a1a2b1b2 (mod n)

Therefore, the task is limited to just determining the higher 32 bits. To calculate higher 32 bits of a result, following values should be added together.

  • Higher 32 bits of multiplication of both lower 32 bits (overflow which CPU can store in edx)
  • Higher 32 bits of first variable mulitplied with lower 32 bits of second variable
  • Lower 32 bits of first variable multiplied with higher 32 bits of second variable

This gives about 5 instructions, however because of relatively limited number of registers in x86 (ignoring extensions to an architecture), they cannot take too much advantage of pipelining. Enable SSE if you want to improve speed of multiplication, as this increases number of registers.

Division/Modulo (both are similar in implementation)

I don't know how it works, but it's much more complex than addition, subtraction or even multiplication. It's likely to be ten times slower than division on 64-bit CPU however. Check "Art of Computer Programming, Volume 2: Seminumerical Algorithms", page 257 for more details if you can understand it (I cannot in a way that I could explain it, unfortunately).

If you divide by a power of 2, please refer to shifting section, because that's what essentially compiler can optimize division to (plus adding the most significant bit before shifting for signed numbers).

Or/And/Xor

Considering those operations are single bit operations, nothing special happens here, just bitwise operation is done twice.

Shifting left/right

Interestingly, x86 actually has an instruction to perform 64-bit left shift called shld, which instead of replacing the least significant bits of value with zeros, it replaces them with most significant bits of a different register. Similarly, it's the case for right shift with shrd instruction. This would easily make 64-bit shifting a two instructions operation.

However, that's only a case for constant shifts. When a shift is not constant, things get tricker, as x86 architecture only supports shift with 0-31 as a value. Anything beyond that is according to official documentation undefined, and in practice, bitwise and operation with 0x1F is performed on a value. Therefore, when a shift value is higher than 31, one of value storages is erased entirely (for left shift, that's lower bytes, for right shift, that's higher bytes). The other one gets the value that was in the register that was erased, and then shift operation is performed. This in result, depends on branch predictor to make good predictions, and is a bit slower because a value needs to be checked.

__builtin_popcount[ll]

__builtin_popcount(lower) + __builtin_popcount(higher)

Other builtins

I'm too lazy to finish the answer at this point. Does anyone even use those?

Unsigned vs signed

Addition, subtraction, multiplication, or, and, xor, shift left generate the exact same code. Shift right uses only slightly different code (arithmetic shift vs logical shift), but structurally it's the same. It's likely that division does generate a different code however, and signed division is likely to be slower than unsigned division.

Benchmarks

Benchmarks? They are mostly meaningless, as instruction pipelining will usually lead to things being faster when you don't constantly repeat the same operation. Feel free to consider division slow, but nothing else really is, and when you get outside of benchmarks, you may notice that because of pipelining, doing 64-bit operations on 32-bit CPU is not slow at all.

Benchmark your own application, don't trust micro-benchmarks that don't do what your application does. Modern CPUs are quite tricky, so unrelated benchmarks can and will lie.

Konrad Borowski
  • 11,584
  • 3
  • 57
  • 71
2

Your question sounds pretty weird in its environment. You use time_t that uses up 32 bits. You need additional info, what means more bits. So you are forced to use something bigger than int32. It doesn't matter what the performance is, right? Choices will go between using just say 40 bits or go ahead to int64. Unless millions of instances must be stored of it, the latter is a sensible choice.

As others pointed out the only way to know the true performance is to measure it with profiler, (in some gross samples a simple clock will do). so just go ahead and measure. It must not be hard to globalreplace your time_t usage to a typedef and redefine it to 64 bit and patch up the few instances where real time_t was expected.

My bet would be on "unmeasurable difference" unless your current time_t instances take up at least a few megs of memory. on current Intel-like platforms the cores spend most of the time waiting for external memory to get into cache. A single cache miss stalls for hundred(s) of cycles. What makes calculating 1-tick differences on instructions infeasible. Your real performance may drop due yo things like your current structure just fits a cache line and the bigger one needs two. And if you never measured your current performance you might discover that you could gain extreme speedup of some funcitons just by adding some alignment or exchange order of some members in a structure. Or pack(1) the structure instead of using the default layout...

Balog Pal
  • 16,195
  • 2
  • 23
  • 37
  • Well I don't need the additional precision in all places - some algorithms can run fine with time_t precision. The question is whether I should use two different time types in my code (as performance improvement), or can get away with always using int64_t even in places where the additional precision isn't needed. But yes, I will set up some benchmarks with real-world scenarios to see whether this really matters. – oliver May 31 '13 at 09:12
0

Addition/subtraction basically becomes two cycles each, multiplication and division depend on the actual CPU. The general perfomance impact will be rather low.

Note that Intel Core 2 supports EM64T.

dom0
  • 7,356
  • 3
  • 28
  • 50
  • is Intel Core 2 a 32 bit processor? No, it's a 64 bit processor. – Dan May 30 '13 at 16:49
  • @Dan But the system that runs on it may be 32 bit. Then the program won't use 64 bit instructions either AFAIK, because the OS doesn't support 64 bit and because the compiler has to assume a 32 bit ABI and instruction set. –  May 30 '13 at 17:02