why is it faster to print number in binary using arithmetic instead of _bittest

Question

The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:

#include <intrin.h>
#include <stdio.h>  
#include <Windows.h>

long num = 78002;
int main()
{
    unsigned char bits[32];
    long nBit;
    LARGE_INTEGER a, b, f;
    QueryPerformanceCounter(&a);
    for (size_t i = 0; i < 100000000; i++)
    {
        for (nBit = 0; nBit < 31; nBit++)
        {
            bits[nBit] = _bittest(&num, nBit);
        }
    }
    QueryPerformanceCounter(&b);
    QueryPerformanceFrequency(&f);
    printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);

    printf_s("Binary representation:\n");
    while (nBit--)
    {
        if (bits[nBit])
            printf_s("1");
        else
            printf_s("0");
    }
    return 0;
}

the inner loop is compile to the instructions bt and setb
The second code section:

#include <intrin.h>
#include <stdio.h>  
#include <Windows.h>
long num = 78002;
int main()
{
    unsigned char bits[32];
    long nBit;

    LARGE_INTEGER a, b, f;
    QueryPerformanceCounter(&a);
    for (size_t i = 0; i < 100000000; i++)
    {
        long curBit = 1;
        for (nBit = 0; nBit < 31; nBit++)
        {
            bits[nBit] = (num&curBit) >> nBit;
            curBit <<= 1;
        }
    }
    QueryPerformanceCounter(&b);
    QueryPerformanceFrequency(&f);
    printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);

    printf_s("Binary representation:\n");
    while (nBit--)
    {
        if (bits[nBit])
            printf_s("1");
        else
            printf_s("0");
    }
    return 0;
}

The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.

Why three cpu instructions run faster then two?

Why don't you post the relevant assembly rather than describe it? — EOF, Nov 18 '17 at 21:18

Ped7g · Answer 1 · 2017-11-19T01:15:39.617

Not answer (Bo did), but the second inner loop version can be simplified a bit:

    long numCopy = num;
    for (nBit = 0; nBit < 31; nBit++) {
        bits[nBit] = numCopy & 1;
        numCopy >>= 1;
    }

Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.

_{(I'm assuming 32b target, as you convert long into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes <windows.h>, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))}

Answer:

Why three cpu instructions run faster then two?

Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.

Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.

If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.

You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...

In the x86-64 Windows ABI, `long` is a 32-bit type. Everyone else (System V) uses 64-bit `long`, though. — Peter Cordes, Nov 19 '17 at 04:41
haha, I still don't know whether that gamedev answer I wrote was a good idea or not. I think I put enough caveats and "it's not that simple" warnings around the one-dimensional cost list that it's safe to recommend. — Peter Cordes, Nov 19 '17 at 04:47

score 1 · Answer 2 · answered Nov 19 '17 at 06:42

I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).

_bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!

I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...

(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)

Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)

This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:

    for (nBit = 0; nBit < 31; nBit++) {
        bits[nBit] = _bittest(&num, nBit);
        //bits[nBit] = (bool)(num & (1UL << nBit));   // much more efficient
    }

Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer

$LL7@main:
    bt       DWORD PTR num, ebx          ; 10 uops (microcoded), one per 5 cycle throughput
    lea      rcx, QWORD PTR [rcx+1]      ; 1 uop
    setb     al                          ; 1 uop
    inc      ebx                         ; 1 uop
    mov      BYTE PTR [rcx-1], al        ; 1 uop (micro-fused store-address and store-data)
    cmp      ebx, 31
    jb       SHORT $LL7@main             ; 1 uop (macro-fused with cmp)

That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.

IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?

Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.

The inner loop compile to and add(as shift left) and sar.

Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)

But the whole loop is this clunky mess:

    long curBit = 1;
    for (nBit = 0; nBit < 31; nBit++)  {
        bits[nBit] = (num&curBit) >> nBit;
        curBit <<= 1;
    }

$LL18@main:               # MSVC CL19  -Ox
    mov      ecx, ebx                  ; 1 uop
    lea      r8, QWORD PTR [r8+1]      ; 1 uop   pointer-increment for bits
    mov      eax, r9d                  ; 1 uop.  r9d holds num
    inc      ebx                       ; 1 uop
    and      eax, edx                  ; 1 uop
       # MSVC says all the rest of these instructions are from             curBit <<= 1; but they're obviously not.
    add      edx, edx                  ; 1 uop
    sar      eax, cl                   ; 3 uops (variable-count shifts suck)
    mov      BYTE PTR [r8-1], al       ; 1 uop (micro-fused)
    cmp      ebx, 31
    jb       SHORT $LL18@main         ; 1 uop (macro-fused with cmp)

So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.

I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.

Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)

There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc

How to do this fast in asm or with intrinsics

Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.

The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.

Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.
is there an inverse instruction to the movemask instruction in intel avx2?.

In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.

  mov    eax, [num]
  lea    rdi, [rsp + xxx]  ; bits[]
.loop:
    shr   eax, 1     ; constant-count shift is efficient (1 uop).  CF = last bit shifted out
    setc  [rdi]      ; 2 uops, but just as efficient as setc reg / mov [mem], reg

    shr   eax, 1
    setc  [rdi+1]

    add   rdi, 2
    cmp   end_pointer    ; compare against another register instead of a separate counter.
    jb   .loop

Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.

Bo Persson · Answer 3 · 2017-11-19T11:51:34.950

0

The difference is that the code _bittest(&num, nBit); uses a pointer to num, which makes the compiler store it in memory. And the memory access makes the code a lot slower.

        bits[nBit] = _bittest(&num, nBit);
00007FF6D25110A0  bt          dword ptr [num (07FF6D2513034h)],ebx     ; <-----
00007FF6D25110A7  lea         rcx,[rcx+1]  
00007FF6D25110AB  setb        al  
00007FF6D25110AE  inc         ebx  
00007FF6D25110B0  mov         byte ptr [rcx-1],al

The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.

edited Nov 19 '17 at 11:51

answered Nov 19 '17 at 00:38

Bo Persson

90,663
31
146
203

"and therefore forces"... no, that's just a missed optimization by whatever compiler you used. `bt reg,reg` is fast (single-uop with 1c latency, one per 0.5 clocks on Skylake), `bt [mem], reg` is slow (like 10 uops on Skylake, one per 5 clocks). – Peter Cordes Nov 19 '17 at 04:39
Ok, "forces" might have been a bit strong, but the use of `&num` is what makes the compiler give `num` a memory address. Otherwise I have seen code equivalent of `num & (1 << nBit)` where the compiler generates `bt` on register only variables. We might learn here that use of `_bittest` might be not only premature but also not really an optimization. – Bo Persson Nov 19 '17 at 12:11
My guess is that MSVC take the `_bittest()` intrinsic literally, and always compiles it to a `bt [mem]`, at least when the 2nd arg isn't a constant that it can check is less than the width of the operand. This is typically slower than masking/shifting to split the register value into an address and a mask for use with `test [mem], reg` to set ZF instead of CF. – Peter Cordes Nov 19 '17 at 15:38
And BTW, it's not that memory access is slow: `test [mem], reg` would have been fine (a waste of a load uop, but no extra front-end cost). It's `bt [mem], reg` specifically that's slow. Re-reading something that's hot in L1D cache is very cheap. – Peter Cordes Nov 19 '17 at 15:40

why is it faster to print number in binary using arithmetic instead of _bittest

3 Answers3

How to do this fast in asm or with intrinsics

Linked