38

In C, why is signed int faster than unsigned int? True, I know that this has been asked and answered multiple times on this website (links below). However, most people said that there is no difference. I have written code and accidentally found a significant performance difference.

Why would the "unsigned" version of my code be slower than the "signed" version (even when testing the same number)? (I have a x86-64 Intel processor).

Similar links

Compile Command: gcc -Wall -Wextra -pedantic -O3 -Wl,-O3 -g0 -ggdb0 -s -fwhole-program -funroll-loops -pthread -pipe -ffunction-sections -fdata-sections -std=c11 -o ./test ./test.c && strip --strip-all --strip-unneeded --remove-section=.note --remove-section=.comment ./test


signed int version

NOTE: There is no difference if I explicitly declare signed int on all numbers.

int isprime(int num) {
    // Test if a signed int is prime
    int i;
    if (num % 2 == 0 || num % 3 == 0)
        return 0;
    else if (num % 5 == 0 || num % 7 == 0)
        return 0;
    else {
        for (i = 11; i < num; i += 2) {
            if (num % i == 0) {
                if (i != num)
                    return 0;
                else
                    return 1;
            }
        }
    }
    return 1;
}

unsigned int version

int isunsignedprime(unsigned int num) {
    // Test if an unsigned int is prime
    unsigned int i;
    if (num % (unsigned int)2 == (unsigned int)0 || num % (unsigned int)3 == (unsigned int)0)
        return 0;
    else if (num % (unsigned int)5 == (unsigned int)0 || num % (unsigned int)7 == (unsigned int)0)
        return 0;
    else {
        for (i = (unsigned int)11; i < num; i += (unsigned int)2) {
            if (num % i == (unsigned int)0) {
                if (i != num)
                    return 0;
                else
                    return 1;
            }
        }
    }
    return 1;
}

Test this in a file with the below code:

int main(void) {
    printf("%d\n", isprime(294967291));
    printf("%d\n", isprime(294367293));
    printf("%d\n", isprime(294967293));
    printf("%d\n", isprime(294967241)); // slow
    printf("%d\n", isprime(294967251));
    printf("%d\n", isprime(294965291));
    printf("%d\n", isprime(294966291));
    printf("%d\n", isprime(294963293));
    printf("%d\n", isprime(294927293));
    printf("%d\n", isprime(294961293));
    printf("%d\n", isprime(294917293));
    printf("%d\n", isprime(294167293));
    printf("%d\n", isprime(294267293));
    printf("%d\n", isprime(294367293)); // slow
    printf("%d\n", isprime(294467293));
    return 0;
}

Results (time ./test):

Signed - real 0m0.949s
Unsigned - real 0m1.174s
chqrlie
  • 131,814
  • 10
  • 121
  • 189
Devyn Collier Johnson
  • 4,124
  • 3
  • 18
  • 39
  • 4
    It may just be due to the overhead of all of the explicit casting. – Cᴏʀʏ Dec 08 '15 at 20:13
  • 1
    did you tried to explicit cast to `signed` also ? – Michi Dec 08 '15 at 20:15
  • Have a look http://stackoverflow.com/questions/4712315/performance-of-unsigned-vs-signed-integers – ashiquzzaman33 Dec 08 '15 at 20:16
  • This is probably processor specific, no? – Fiddling Bits Dec 08 '15 at 20:17
  • Good points, everyone. Let me try explicitly casting "signed". – Devyn Collier Johnson Dec 08 '15 at 20:19
  • 14
    The first thing I'd do is compare the generated assembly code for both cases, and see if any additional instructions are being emitted in the unsigned case. – Tom Karzes Dec 08 '15 at 20:19
  • 1
    Also, are you compiling with -O enabled? – Tom Karzes Dec 08 '15 at 20:20
  • 15
    When writing benchmarks I strongly suggest to increase the measurement time to at least 30s and to run the task executing the benchmark with highest priority possible. Otherwise your measured 20% difference may be cause by a great deal by the OS. – Lukas Thomsen Dec 08 '15 at 20:22
  • I updated the question with additional information. Explicitly declaring "signed int" made no difference. @TomKarzes , my compiler flags were posted prior to any edits. – Devyn Collier Johnson Dec 08 '15 at 20:24
  • 10
    @Cᴏʀʏ they're free casts anyway. They tell the compiler to change the type, but cost no code to implement. – harold Dec 08 '15 at 20:25
  • 1
    Aside from the systematic error pointed out by @LukasThomsen, there is no evidence of a statistically significant difference. –  Dec 08 '15 at 20:29
  • 4
    Both assembly code only differ by (un)signed arithmetic (`idivl` vs `divl`) on my machine, but execution time is roughly the same OP showed. This difference is then due to these instructions... Is there any border effect due to instruction parallelism, out-of-order execution, etc? – Jean-Baptiste Yunès Dec 08 '15 at 20:32
  • What's the int size? If it's 32 bits, there could be some overhead for sign-extending 32-bit constants to 64-bit registers. But taking a look at the generated assembly would reveal that. Take a debugger and see. – Seva Alekseyev Dec 08 '15 at 20:34
  • @LukasThomsen and Rhymoid , I tested the code with more numbers (all unique), and "unsigned int" is still slower than "signed int". I added enough numbers to see a 50% difference. You had a good idea, though. – Devyn Collier Johnson Dec 08 '15 at 20:36
  • For example on Haswell, 32bit `div` (unsigned) takes one more µop (10 vs 9) and has a slightly worse (on average) throughput compared to 32bit `idiv` (signed), but that seems like too small a difference to explain this – harold Dec 08 '15 at 20:41
  • 1
    No this can explain it, the observed difference is not so huge and as I said the code only differ (for me) on signedness. What is strange is that when I changed int to long then results are the converse! It is much longer for signed than for unsigned (almost the same proportion than the original test). – Jean-Baptiste Yunès Dec 08 '15 at 20:44
  • @Jean-BaptisteYunès usually the 64bit `idiv` is a lot worse than the 64bit `div`. But the difference in time here is over 20% right, while the difference in throughput between 32bit `idiv` and `div` is nowhere near that – harold Dec 08 '15 at 20:50
  • 7
    `(unsigned int)3` can be more clearly written `3u` – M.M Dec 08 '15 at 21:03
  • 1
    To solve this you should `diff` the generated assembly code. – M.M Dec 08 '15 at 21:03
  • 1
    By the way, there is no cost in writing `2U` which is unsigned literal, making `(unsigned)` cast unnecessary (and more readable code). – Luis Colorado Dec 11 '15 at 06:32
  • `(signed)2` is exactly the same as `2` because integer literals must be int unless it doesn't fit in an `int`. Same to `(unsigned)2` and `2U` – phuclv Nov 26 '16 at 16:59
  • 1
    Your code is **wrong** because it'll return false for 2, 3, 5, 7. And it's extremely inefficient because it checks for all odd divisors up to `num` whereas this at least could be optimized down to [only factors in the form `6k±1` of `num` up until `sqrt(num)`](http://stackoverflow.com/a/26760082/995714) which significantly reduces the number of loops needed – phuclv Nov 26 '16 at 17:09
  • A bit of pedantry: if you write `int x`, you are not *explicitly* declaring a signed int. You are declaring it *implicitly*. To be explicit, you would write `signed int x`. To explicitly declare `x` signed, and implicitly declare it an integer, you would write `signed x`. – William Pursell Aug 31 '17 at 13:23
  • 1
    Any performance difference is due to the compiler or test method and not due to `C` _In C, why is “signed int” faster than “unsigned int”?_ is not the true question here. – chux - Reinstate Monica Aug 31 '17 at 13:37
  • Have you tried running your test with the -fwrapv option? – shadow_map Sep 01 '17 at 07:48

5 Answers5

36

Your question is genuinely intriguing as the unsigned version consistently produces code that is 10 to 20% slower. Yet there are multiple problems in the code:

  • Both functions return 0 for 2, 3, 5 and 7, which is incorrect.
  • The test if (i != num) return 0; else return 1; is completely useless as the loop body is only run for i < num. Such a test would be useful for the small prime tests but special casing them is not really useful.
  • the casts in the unsigned version are redundant.
  • benchmarking code that produces textual output to the terminal is unreliable, you should use the clock() function to time CPU intensive functions without any intervening I/O.
  • the algorithm for prime testing is utterly inefficient as the loop runs num / 2 times instead of sqrt(num).

Let's simplify the code and run some precise benchmarks:

#include <stdio.h>
#include <time.h>

int isprime_slow(int num) {
    if (num % 2 == 0)
        return num == 2;
    for (int i = 3; i < num; i += 2) {
        if (num % i == 0)
            return 0;
    }
    return 1;
}

int unsigned_isprime_slow(unsigned int num) {
    if (num % 2 == 0)
        return num == 2;
    for (unsigned int i = 3; i < num; i += 2) {
        if (num % i == 0)
            return 0;
    }
    return 1;
}

int isprime_fast(int num) {
    if (num % 2 == 0)
        return num == 2;
    for (int i = 3; i * i <= num; i += 2) {
        if (num % i == 0)
            return 0;
    }
    return 1;
}

int unsigned_isprime_fast(unsigned int num) {
    if (num % 2 == 0)
        return num == 2;
    for (unsigned int i = 3; i * i <= num; i += 2) {
        if (num % i == 0)
            return 0;
    }
    return 1;
}

int main(void) {
    int a[] = {
        294967291, 0, 294367293, 0, 294967293, 0, 294967241, 1, 294967251, 0,
        294965291, 0, 294966291, 0, 294963293, 0, 294927293, 1, 294961293, 0,
        294917293, 0, 294167293, 0, 294267293, 0, 294367293, 0, 294467293, 0,
    };
    struct testcase { int (*fun)(); const char *name; int t; } test[] = {
        { isprime_slow, "isprime_slow", 0 },
        { unsigned_isprime_slow, "unsigned_isprime_slow", 0 },
        { isprime_fast, "isprime_fast", 0 },
        { unsigned_isprime_fast, "unsigned_isprime_fast", 0 },
    };

    for (int n = 0; n < 4; n++) {
        clock_t t = clock();
        for (int i = 0; i < 30; i += 2) {
            if (test[n].fun(a[i]) != a[i + 1]) {
                printf("%s(%d) != %d\n", test[n].name, a[i], a[i + 1]);
            }
        }
        test[n].t = clock() - t;
    }
    for (int n = 0; n < 4; n++) {
        printf("%21s: %4d.%03dms\n", test[n].name, test[n].t / 1000), test[n].t % 1000);
    }
    return 0;
}

The code compiled with clang -O2 on OS/X produces this output:

         isprime_slow:  788.004ms
unsigned_isprime_slow:  965.381ms
         isprime_fast:    0.065ms
unsigned_isprime_fast:    0.089ms

These timings are consistent with the OP's observed behavior on a different system, but show the dramatic improvement caused by the more efficient iteration test: 10000 times faster!

Regarding the question Why is the function slower with unsigned?, let's look at the generated code (gcc 7.2 -O2):

isprime_slow(int):
        ...
.L5:
        movl    %edi, %eax
        cltd
        idivl   %ecx
        testl   %edx, %edx
        je      .L1
.L4:
        addl    $2, %ecx
        cmpl    %esi, %ecx
        jne     .L5
.L6:
        movl    $1, %edx
.L1:
        movl    %edx, %eax
        ret

unsigned_isprime_slow(unsigned int):
        ...
.L19:
        xorl    %edx, %edx
        movl    %edi, %eax
        divl    %ecx
        testl   %edx, %edx
        je      .L22
.L18:
        addl    $2, %ecx
        cmpl    %esi, %ecx
        jne     .L19
.L20:
        movl    $1, %eax
        ret
       ...
.L22:
        xorl    %eax, %eax
        ret

The inner loops are very similar, same number of instructions, similar instructions. Here are however some potential explanations:

  • cltd extends the sign of the eax register into the edx register, which may be causing an instruction delay because eax is modified by the immediately preceeding instruction movl %edi, %eax. Yet this would make the signed version slower than the unsigned one, not faster.
  • the loops' initial instructions might be misaligned for the unsigned version, but it is unlikely as changing the order in the source code has no effect on the timings.
  • Although the register contents are identical for the signed and unsigned division opcodes, it is possible that the idivl instruction take fewer cycles than the divl instruction. Indeed the signed division operates on one less bit of precision than the unsigned division, but the difference seems quite large for this small change.
  • I suspect more effort was put into the silicon implementation of idivl because signed divisions are more common that unsigned divisions (as measured by years of coding statistics at Intel).
  • as commented by rcgldr, looking at instruction tables for Intel process, for Ivy Bridge, DIV 32 bit takes 10 micro ops, 19 to 27 cycles, IDIV 9 micro ops, 19 to 26 cycles. The benchmark times are consistent with these timings. The extra micro-op may be due to the longer operands in DIV (64/32 bits) as opposed to IDIV (63/31 bits).

This surprising result should teach us a few lessons:

  • optimizing is a difficult art, be humble and procrastinate.
  • correctness is often broken by optimizations.
  • choosing a better algorithm beats optimization by a long shot.
  • always benchmark code, do not trust your instincts.
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Why not precompute `sqrt(m)` rounded down (float cast to unsigned int, say, once and use that as the upper limit in the iteration? That's faster than testing `i * i <= m` – Henno Brandsma Aug 31 '17 at 22:16
  • 3
    @HennoBrandsma: it may be faster or not faster, only benchmarking can tell... Multiplication is quite fast on modern CPUs. There is also a precision issue for large values of `num`. Last and not least, I was trying to have very simple functions so the assembly code would stay simple too. – chqrlie Aug 31 '17 at 22:22
  • This answer seems to have the most actual information, but it still doesn't clear things up for me. – Michael Burr Aug 31 '17 at 22:36
  • @MichaelBurr: I suspect more effort was put into the silicon implementation of `idivl` because signed divisions are more common that unsigned divisions (as measured by years of coding statistics at Intel), but I have no proof. – chqrlie Aug 31 '17 at 22:42
  • 2
    Looking at instruction tables for Intel process, for Ivy Bridge, DIV 32 bit takes 10 micro ops, 19 to 27 cycles, IDIV 9 micro ops, 19 to 26 cycles. On WIndows XP (my only 32 bit OS), Intel 3770K 3.5 ghz, Visual Studio, the fast times are 0.048 ms for int, 0.065 ms for unsigned int. – rcgldr Aug 31 '17 at 23:12
14

Because signed integer overflow is undefined, the compiler can make a lot of assumptions and optimizations on code involving signed integers. Unsigned integer overflow is defined to wrap around, so the compiler won't be able to optimize as much. See also http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html#signed_overflow and http://www.airs.com/blog/archives/120.

shadow_map
  • 313
  • 3
  • 15
  • 3
    Your statements are correct, but looking at the code generated by x86 compilers, it does not seem to be a pertinent explanation. – chqrlie Aug 31 '17 at 22:48
  • Your statements looks right but it seems to be answering the reverse question. So wrong at last. – iBug Sep 01 '17 at 00:31
3

From Instruction specification on AMD/Intel we have (for K7):

Instruction Ops Latency Throughput
DIV r32/m32 32  24      23
IDIV r32    81  41      41
IDIV m32    89  41      41 

For i7, latency and throughput are the same for IDIVL and DIVL, a slight difference exists for the µops.

This may explain the difference as -O3 assembly codes only differ by signedness (DIVL vs IDIVL) on my machine.

Jean-Baptiste Yunès
  • 34,548
  • 4
  • 48
  • 69
  • 9
    Hold on, that's the wrong way around: this would make unsigned faster – harold Dec 08 '15 at 20:59
  • But this table says it should be faster for 32bit unsigned, which it wasn't (but of course OP probably doesn't have a K7 anymore) – harold Dec 08 '15 at 21:20
  • 2
    These may be worst case latency, for something like negative operands. Does your measured performance change when operands are negative? – Michael Dec 08 '15 at 22:14
  • the `unsigned` version uses `div` which will have to be faster – phuclv Nov 26 '16 at 17:19
0

Alternative wiki candidate test that may/may not show a significant time difference.

#include <stdio.h>
#include <time.h>

#define J 10
#define I 5

int main(void) {
  clock_t c1,c2,c3;
  for (int j=0; j<J; j++) {
    c1 = clock();
    for (int i=0; i<I; i++) {
      isprime(294967241);
      isprime(294367293);
    }
    c2 = clock();
    for (int i=0; i<I; i++) {
      isunsignedprime(294967241);
      isunsignedprime(294367293);
    }
    c3 = clock();
    printf("%d %d %d\n", (int)(c2-c1), (int)(c3-c2), (int)((c3-c2) - (c2-c1)));
    fflush(stdout);
  }
  return 0;
}

Sample output

2761 2746 -15
2777 2777 0
2761 2745 -16
2793 2808 15
2792 2730 -62
2746 2730 -16
2746 2730 -16
2776 2793 17
2823 2808 -15
2793 2823 30
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
-1

In fact in many cases unsigned is faster than signed for eample

  • In dividing by a power of 2
    unsigned int x=37;
    cout<<x/4;
  • In checking if a number if even
    unsigned int x=37;
    cout<<(x%2==0)?"even":"odd";
ishandutta2007
  • 16,676
  • 16
  • 93
  • 129
  • 1
    *"in many cases unsigned is faster than signed*" do you have evidence for this claim? Your examples only show how to use `unsigned int`s, but your answer does not show anything to do with performance and thus does not really answer the question. – Thomas Flinkow Sep 22 '22 at 09:57
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 22 '22 at 09:58
  • @ThomasFlinkow For bechmarking check [this](https://stackoverflow.com/a/4712394/20008608) answer – ishandutta2007 Sep 24 '22 at 10:29