8

Today, I noticed that the speed of several simple bitwise and arithmetic operations differs significantly between int, unsigned, long long and unsigned long long on my 64-bit pc.

In particular, the following loop is about twice as fast for unsigned as for long long, which I didn't expect.

int k = 15;
int N = 30;
int mask = (1 << k) - 1;
while (!(mask & 1 << N)) {
    int lo = mask & ~(mask - 1);
    int lz = (mask + lo) & ~mask;
    mask |= lz;
    mask &= ~(lz - 1);
    mask |= (lz / lo / 2) - 1;
}

(full code here)

Here are the timings (in seconds) (for g++ -O, -O2 and -O3):

1.834207723 (int)
3.054731598 (long long)
1.584846237 (unsigned)
2.201142018 (unsigned long long)

These times are very consistent (i.e. a 1% margin). Without -O flag, each one is about one second slower, but the relative speeds are the same.

Is there a clear reason for this? Vectorization might be it for the 32-bit types, but I can't see where the huge difference between long long and unsigned long long comes from. Are some operations just much slower on some types than on others, or is it just a general thing that 64-bit types are slower (even on 64-bit architectures)?

For those interested, this loop loops over all subsets of {1,2,...,30} with exactly 15 elements. This is done by looping (in order) over all integers less than 1<<30 with exactly 15 bits set. For the current case, that's 155117520 iterations. I don't know the source of this snippet anymore, but this post explains some more.

Edit

It seems from the assembly code that the division can be made faster when the type is unsigned. I think that makes sense, because we don't have to account for the sign bit.

Furthermore, the 32-bit operations use movl and other xxxl instructions, while the 64-bit operations use movq and xxxq.

Edit 2

After reading the post I linked, I decided to use the formula given over there:

T k = 15;
T N = 30;
T mask = (1 << k) - 1;
while (!(mask & 1 << N)) {
    T t = mask | (mask - 1);
    mask = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(mask) + 1));
}

This runs in about a third of the time of the code posted above, and uses the same time for all four types.

Community
  • 1
  • 1
Ragnar
  • 434
  • 7
  • 16
  • 1
    Have you looked at the generated assembly? – Nate Eldredge Nov 21 '15 at 20:44
  • Well, assembly is not something I really excel in, but it might be worth a try – Ragnar Nov 21 '15 at 20:46
  • 2
    Double-check your binary is x64? – moltenform Nov 21 '15 at 20:48
  • @jmnben `file ./a.out` starts with `ELF 64-bit` and `x86-64`, so that seems right. – Ragnar Nov 21 '15 at 20:54
  • Intel Compiler is smart enough to optimize the calculations at compile time - the execution time is 0 seconds. – Andrey Nasonov Nov 21 '15 at 20:55
  • I am probably too lazy to check it myself (since I dont know what exactly you are doing inside the loop) but, are you sure that the loops have the same *number of calculations* in every case? Moreover, you cannot draw such a conclusion with a test that runs only once, you'd probably have to embed each case inside a wider loop, to have more serious results. BTW it is hard to believe that your simple test case takes many seconds :o – A.S.H Nov 21 '15 at 21:07
  • @A.S.H I'll edit the post to explain what the loop actually does. – Ragnar Nov 21 '15 at 21:14
  • @A.S.H, the number of calculations is the same. The variable mask does not take values that cannot be fit in 0 .. 2^31-1 range. – Andrey Nasonov Nov 21 '15 at 21:20
  • see [also](http://stackoverflow.com/a/5069643/3288910). – Jason Nov 21 '15 at 21:44

1 Answers1

9

The slowest operation in your code is

mask |= (lz / lo / 2) - 1

32-bit division is significantly faster than 64-bit division. For example, on Ivy Bridge, 32 bit IDIV takes 19-26 clocks while 64 bit IDIV takes 28-103 clocks latency.

The unsigned version is also faster than signed because the division by 2 is simple bit shift in the unsigned case and there is no size expansion calls (CDQ, CQO).

in the unsigned case is simple bit shift while in signed

[1] http://www.agner.org/optimize/instruction_tables.pdf

Andrey Nasonov
  • 2,619
  • 12
  • 26
  • You probably mean "...is significantly faster..." – Sami Kuhmonen Nov 21 '15 at 21:23
  • After using code without a division, all four types use the same speed. Also, the assembly was pretty much the same apart from the division instruction, so I guess you're right. – Ragnar Nov 21 '15 at 21:39