Speed difference for bitwise operations

Question

Below two code blocks calculate the Hamming distance between two integers. They are the same, but why is the first one faster than the second one on LeetCode?

Fast:

int hammingDistance(int x, int y) {
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += (x&1) ^ (y&1);
        x >>= 1;
        y >>= 1;
    }
    return count;
}

Slow:

int hammingDistance(int x, int y) {
    int n = x ^ y;
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += n & 1;
        n >>= 1;
    }
    return count;
}

**** Update ****

I have written a test code on my Mac machine:

#include <time.h>
#include "cstdio"

int hamm_fast(int x, int y) {
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += (x&1) ^ (y&1);
        x >>= 1;
        y >>= 1;
    }
    return count;
}

int hamm_slow(int x, int y) {
    int n = x ^ y;
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += n & 1;
        n >>= 1;
    }
    return count;
}

int main()
{
    clock_t begin;
    clock_t end;
    double time_spent;

    // benchmark fast
    begin = clock();
    for (int i = 0; i < 100000; ++i)
        hamm_fast(100,100);
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("Fast: %f ms\n", time_spent*1e3);

    // benchmark slow
    begin = clock();
    for (int i = 0; i < 100000; ++i)
        hamm_slow(100,100);
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("Slow: %f ms\n", time_spent*1e3);

    return 0;
}

And compile and run as

g++ ham.cpp && ./a.out

And the two solutions turn out to be similar. For example:

Fast: 7.233000 ms
Slow: 6.963000 ms

Actually the slow one is faster ...

That kind of difference can be accounted for by the fact that there are more operations in the latter example, or how fast the wind is blowing outside. You will get a better benchmark result if you put the whole thing in a loop, run it 10,000 times and measure that. — Robert Harvey, Aug 10 '18 at 17:46
@RobertHarvey I don't believe a single run of this function will be of ms order. I hope the OP does mean the looped benchmark. — Eugene Sh., Aug 10 '18 at 17:47
@EugeneSh. There's probably all sorts of startup overhead in those two times cited. — Robert Harvey, Aug 10 '18 at 17:48
I have updated the problem with a test case. Thanks for your helps! — WDC, Aug 10 '18 at 17:57
You should at least enable optimizations (`g++ -O2 ham.cpp`). — melpomene, Aug 10 '18 at 18:10
@melpomene Yeah. But then it will be super fast, as `Fast: 0.002000 ms Slow: 0.001000 ms` — WDC, Aug 10 '18 at 18:12
@EugeneSh. And this is why it's a bad idea to try and second guess how fast something will go. Benchmarking is the only way to be sure. — dgnuff, Aug 10 '18 at 18:25
I don't see any reason in this exercise anyway. I don't believe this function is a bottleneck to something. The second variant is more elegant if you ask me, and I would pick it. If you want to optimize it a bit, you can use some intrinsic `popcnt`. — Eugene Sh., Aug 10 '18 at 18:28
@EugeneSh. Thank you. Could you post it as an answer so I will accept it. Thanks again! — WDC, Aug 10 '18 at 18:32
Do you get the same results if you test "slow" before "fast"? — Tim Randall, Aug 10 '18 at 19:13
If you want to understand how your code would translate straight into assembly you should set -O0. That will show that slow will probably be faster because in the faster you are doing there a xor and two shift inside a loop: so all time, while you have the xor outside plus a single shift inside in the slow one. When you start enabling the optimizations, that becomes more unpredictable, as the compiler will try to understand what to do...I suggest that you use g++ -Wa,-adhnl -g speed.c to see mixed c and assembly code to see what the compiler is doing. — Jekyll, Aug 10 '18 at 19:32
Beware: an aggressive optimizer could remove the test loops because you pass the same values to the Hamming distance functions on each iteration and ignore the returned value — which would leave you with nothing to measure. If it were my testing, I'd have a test function containing the timing and function calls. The function call would be inside two loops, and I'd pass the two loop indexes as arguments to the function, and sum the return values, and print the sum at the end to help ensure that the functions produce the same results. I'd also make it run for seconds, not milliseconds. — Jonathan Leffler, Aug 10 '18 at 22:02
'Tis curious this is tagged with C but you have `#include "cstdio"`. I suppose that the 'header' `cstdio` contains `#include ` — but it looks as if you were thinking of coding in C++ with that name. — Jonathan Leffler, Aug 10 '18 at 22:05
Both of your ways are slower than they need to be. https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer — Retired Ninja, Aug 10 '18 at 22:52

score 1 · Accepted Answer · answered Aug 10 '18 at 22:24

As I noted in a comment, you need to be careful with the benchmarking to ensure that an aggressive optimizer cannot optimize the function calls away.

An aggressive optimizer could remove the test loops because you pass the same values to the Hamming distance functions on each iteration and ignore the returned value — which would leave you with nothing to measure.

If it were my testing, I'd have a test function containing the timing and function calls. The function call would be inside two loops, and I'd pass the two loop indexes as arguments to the function, and sum the return values, and print the sum at the end to help ensure that the functions produce the same results. I'd also make it run for seconds, not milliseconds.

Here's my code. It uses timing code available in my SOQ (Stack Overflow Questions) repository on GitHub as files timer.c and timer.h in the src/libsoq sub-directory.

#include "timer.h"
#include <stdio.h>

#define L1_MIN 0
#define L1_MAX 10240
#define L2_MIN 0
#define L2_MAX 10240

static int hamm_fast(int x, int y)
{
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += (x & 1) ^ (y & 1);
        x >>= 1;
        y >>= 1;
    }
    return count;
}

static int hamm_slow(int x, int y)
{
    int n = x ^ y;
    int count = 0;
    for (int i = 0; i < 32; ++i)
    {
        count += n & 1;
        n >>= 1;
    }
    return count;
}

static void tester(const char *tag, int (*function)(int x, int y))
{
    Clock t;
    clk_init(&t);

    clk_start(&t);
    int sum = 0;
    for (int i = L1_MIN; i < L1_MAX; i++)
    {
        for (int j = L2_MIN; j < L2_MAX; j++)
            sum += (*function)(i, j);
    }
    clk_stop(&t);

    char buffer[32];
    int iterations = (L1_MAX - L1_MIN) * (L2_MAX - L2_MIN);
    printf("%s sum = %d (%d iterations) %s\n", tag, sum, iterations,
            clk_elapsed_us(&t, buffer, sizeof(buffer)));
}

int main(void)
{
    for (int i = 0; i < 10; i++)
    {
        tester("Fast", hamm_fast);
        tester("Slow", hamm_slow);
    }
    return 0;
}

The output I got on one run was:

Fast sum = 710934528 (104857600 iterations) 2.461100
Slow sum = 710934528 (104857600 iterations) 1.181584
Fast sum = 710934528 (104857600 iterations) 2.480401
Slow sum = 710934528 (104857600 iterations) 1.182961
Fast sum = 710934528 (104857600 iterations) 2.466685
Slow sum = 710934528 (104857600 iterations) 1.197394
Fast sum = 710934528 (104857600 iterations) 2.435806
Slow sum = 710934528 (104857600 iterations) 1.175533
Fast sum = 710934528 (104857600 iterations) 2.384162
Slow sum = 710934528 (104857600 iterations) 1.184161
Fast sum = 710934528 (104857600 iterations) 2.376042
Slow sum = 710934528 (104857600 iterations) 1.191555
Fast sum = 710934528 (104857600 iterations) 2.389027
Slow sum = 710934528 (104857600 iterations) 1.169186
Fast sum = 710934528 (104857600 iterations) 2.393707
Slow sum = 710934528 (104857600 iterations) 1.209600
Fast sum = 710934528 (104857600 iterations) 2.423526
Slow sum = 710934528 (104857600 iterations) 1.204585
Fast sum = 710934528 (104857600 iterations) 2.515968
Slow sum = 710934528 (104857600 iterations) 1.196783

As you can see, the 'fast' code is about twice as slow as the 'slow' code. That's primarily because the 'fast' code is doing many more operations per loop than the 'slow' code. The 'fast' code does 2 & operations, 1 ^ operation, and 2 >>= operations compared to 1 & and 1 >>= in the 'slow' code. But the results are apparently the same; that's the good news. The functions are equivalent in terms of result, but not in terms of speed.

Compilation on a MacBook Pro running macOS 10.13.6 High Sierra, using GCC 8.2.0.

Compilation command line (source file spot79.c):

$ gcc -O3 -g -I./inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes \
>     -Wstrict-prototypes spot79.c -o spot79 -L./lib -lsoq
$

The timer.h header was in the ./inc directory and the soq library was in ./lib — that's simply my build setup.

Speed difference for bitwise operations

1 Answers1