Speeding up a bit test operation by using assembly

Question

I have a performance problem - I can't beat the release version speed of the compiler generated code at all. It is 25% slower. The function I wrote is being called around 20 million times in my test, so making it run faster will pay off.

the code in C++ is pretty simple:

static inline char GetBit(char *data, size_t bit)
{ 
    return 0 != (data[bit / 8] & (1 << (bit % 8))); 
}

And this is the version I wrote for 64bit MASM:

mov   rax, rdx  
mov   r10, 8h
xor   rdx, rdx  
div   rax, r10  
mov   al, byte ptr [rax+rcx]  
mov   bl, 1h
mov   cl, dl  
shl   bl, cl  
and   al, bl  
shr   al, cl  
ret

Well I'm not much of an assembler guy, but I don't think the compiler can make 25% faster code just creating better assembly. So the trick is [probably] in the function call. It respects the inline keyword for the C++ code and generates no call, but I just can't make it working for the asm code:

extern "C" inline char GetBitAsm(char *data, size_t bit);

I've disassembled the code using dumpbin and I can clearly see my code + function call. while no call is generated for the compiler's version:

mov   rdx, qword ptr [bit]  
mov   rcx, qword ptr [data]  
call  GetBitAsm (013F588EFDh)  
mov   byte ptr [isbit], al

There are additionally 2 readings and one writing to the memory, while in what the compiler generates, there are probably only 1 reading. I read somewhere that div op-code takes around 20 cycles, while single memory access costs 100 cycles at least. So removing mov rdx and mov rcx from the memory, replacing them with values from registers in the parent function will, I think, do the trick

Questions:

Is that really the reason it runs so slow ?
How to make function written in asm inline in release version ?
How can I further enhance my assembly code, to make it even faster ?

Have you actually looked at the disassembly of the pure C version? I wouldn't find it surprising if the compiler's code is significantly faster. Perhaps inlining plays a role (you can test this by putting `GetBit` in a different translation unit) but it's probably not the sole cause. For starters you can replace the `div` with a right shift (by 3) and a bitwise and (by 0x7), and the compiler certainly does that. And once you're down to a couple 1-cycle bit shuffling instructions, scheduling and instruction selection can make all the difference, and compilers are pretty good at that. — , Jul 13 '14 at 13:30
Also note that there is a specialized instruction called `BT` for exactly this purpose. — Jester, Jul 13 '14 at 13:40
IMO, if you're "not much of an assembler guy", assuming that you'll just sit down and write code that outperforms many person-years of experience is a bit much. Compilers haven't produced code as slow as your attempt for decades. — molbdnilo, Jul 13 '14 at 13:56
removing div op-code proved to be a good idea. No my code runs as fast as compiler's code. — user3834213, Jul 13 '14 at 14:00
Depending on what that parent function is doing, you may have a lot more luck optimizing that. You mention caching something in registers there, which would do something (likely not a whole lot), but it implies that you intend to read several bits from the same region and you may be able to combine getting several bits into getting a whole range of bits at once. For example if you're scanning for the first set bit or something like that, that could be significantly improved. — harold, Jul 13 '14 at 14:15
If using MSVC, use the `_bittest` intrinsic. If using gcc, [use inline asm to generate the `bt` instruction](http://stackoverflow.com/a/1983525/902497). — Raymond Chen, Jul 13 '14 at 15:28

score 5 · Answer 1 · answered Jul 13 '14 at 14:09

5

The function call overheard and the DIV instruction in your assembly code are going to kill performance, relative to any compiler's inlined code. Function call overhead alone may be greater the number of cycles the compiler's code takes on average. The DIV instruction could be several times greater.

Memory accesses are often free on modern processors because they can be satisfied from the processor's caches. In your assembly version your memory accesses would on average cost 0 cycles because your code is probably slow enough that the processor can easily prefetch memory into its caches before it needs to access it. On the other hand, the compiler's code is potentially fast enough that it could be reading values from memory faster than the processor can fetch it. It would have to stall periodically waiting for the fetches to compile. So while memory access cycle times on average are going to be higher in the compiler code, this is only because it's much better optimized.

The best solution to your problem is to let the compiler do the optimization. To be quite frank, it seems it knows how to generate better code than you do. Even an assembly expert would have a hard time improving on the compiler, and it would require looking at the problem at a wider scope than just instruction selection of this one function.

If you still rather use your own assembly code then use you compiler's inline assembly functionality, and get rid of the DIV instruction. It still won't perform as well as the compiler's version but that should get it a lot closer.

answered Jul 13 '14 at 14:09

Ross Ridge

38,414
7
81
112

I got rid of the div, which seemed to be the biggest problem. now it run as faster as compiler version ` mov rax, rdx shr rax, 3h and dx, 7h bt word ptr [rcx+rax], dx lahf mov al, ah and al, 1h ret` – user3834213 Jul 13 '14 at 14:35
1

@user3834213 you might also consider using `setc al` – harold Jul 13 '14 at 14:40
You've probably hit the limit of how fast your processor can access memory with both versions. Are you measuring the performance of your code in a realistic situation where you're actually processing the bits as you intend to or are you only measuring the performance of the `GetBit` function in isolation? – Ross Ridge Jul 13 '14 at 14:49
@RossRidge: It is a real-world industrial sample. A set of millions node or element ids (as integers) is returned from a FEA solver run (like nastran, ansys, etc). They are stored into binary tree structures. Then apply intersections, unions, etc operations. I'm looking for a way to decrease the run time by optimizing our code. It may be several hours in some cases. – user3834213 Jul 13 '14 at 15:10
If you're accessing the bits sequentially you can try creating a streaming bit readers that only reads each byte once from the data array. If your data is aligned then reading 64-bits values at a time could also improve performance. – Ross Ridge Jul 13 '14 at 15:26
*" Memory accesses are often free on modern processors because they can be satisfied from the processor's caches.*" Hugh, be careful with sentences like that. Memory access performance depends a lot on the access pattern (Because the caches, as you said). In fact memory access is **one of the most important performance bottlenecks on modern architectures**, in addition to branch prediction issues. Mem access is near to zero when its done in the right way, but when it isn't.... Like in OOP. Wait for ~500 cicles until the data is in your hands – Manu343726 Jul 13 '14 at 17:45
Hugh? Are you talking to me? – Ross Ridge Jul 13 '14 at 21:10
@Ross sorry, was an onomatopoeia, not a name ;). My fault. – Manu343726 Jul 13 '14 at 21:13

score 1 · Answer 2 · answered Jul 13 '14 at 15:45

I'll take a long shot here and speculate a bit on what you are trying to do, so bear with me:

There are a few things that strike me with your code, (both the C++ and the assembly) the first is as has been mentioned by others that you use div and mod. These operations are rather slow, and one of the reasons you will not be able to compete with your compiler is that, most likely, it will optimize these operations away.

You are working with powers of 2, the computer was made for working with powers of 2. That means that this is equivalent to your code:

static inline char GetBit(char *data, size_t bit)
{ 
    return 0 != (data[bit >> 3] & (1 << (bit & 0x07))); 
}

You can use this to improve your assembly, but that will not likely give you much performance improvement.

If you on the other hand is aiming to speed up your code I will suggest the following changes:

In your large bitmask change the base type to your processors native size, that is an uint32_t for 32 bit machines and an uint64_t for 64 bit machines.

Also, divide your getBit() function into two functions, getWord() and getBit().

getWord() should be something long the lines:

static inline uint32_t getWord(const uint32_t *data, size_t bit) { 
    return data[ bit / sizeof(*data)*8 ]; // Again, the compiler will most 
                                          // likely pick up that this is a 
                                          // division by a power of 2 and 
                                          // optimize accordingly.
                                          // Check to be certain.
}

static inline uint32_t getBit(const uint32_t *data, size_t bit) { 
     return getWord(data, bit) & (1 << (bit & (sizeof(*data)*8 - 1)); 
     // Or just % like above, check which is faster. 
}

The real speed up should come if you rewrite the code using this bitmask:
1. If you are jumping around a lot in your buffer, you will probably only get a slight improvement from the suggestions above.
2. If, however you are iterating though the data in a linear fashion I suggest changing your code to this:
```
 uint32_t mask = 1;
 uint32_t word;
 for ( int bit = 0; bit < 2048; i++) {
     word = getWord(buffer, i); // You could also move this outside a smaller loop, but I'm not sure it's worth it.
     if (word & mask) {
         cout << "Bit " << bit << " is set." << endl;
     }

     // Most modern compilers will recognize the following as a ROL 
     // (ROtational Left shift) and replace it with one instruction.
     mask = (mask << 1 | mask >> (sizeof(mask)*8-1)); 
 }
```
The reason this is a good idea is that the processor is optimized to work with native size integers, you avoid problems with alignment, upgrading values in register etc. You may also notice that by using a mask all the way outside in the loop you avoid an extra shift/division as we simply lets the mask roll around when it fills.

Prefer `CHAR_BITS` instead of hardcoded 8. Its rare to find an architecture with no 8 bits/byte, but its possible. — Manu343726, Jul 13 '14 at 21:10
Fair point, but would uint32_t be defined on those architectures, or would you need something different altogether? — Stian Svedenborg, Jul 13 '14 at 21:14

score 0 · Answer 3 · answered Jul 13 '14 at 15:11

In addition to all the things already said you also have to take care about the "inline" function:

You may try to remove the "inline" from the (pure C/C++) function and move the function to another C file so you are sure that the compiler does not inline the function. You will see that the function will run much slower.

The reason: When a function is "inline" the compiler may optimize a lot. When a function is not "inline" the compiler has to store the function arguments on the stack (using "push") and perform a "call" instruction. This will cost a lot of time and make the code much slower than an "inline" function.

For small pieces of code the time required by these operations is much more than the time you may save by using assembler code!

Note that the "move it to a different file" step is important. If the definition is visible, compilers can and do inline without `inline` being present. — , Jul 13 '14 at 15:16
On x86-64, the new ABIs make function calls much cheaper, as you do *not* pass most arguments on the stack. — EOF, Jul 13 '14 at 20:58

Speeding up a bit test operation by using assembly

3 Answers3