6

I made a function to set or clear a specific number of bits in a DWORD. My function works. I don't need help making it work. However, I am wondering if the method I've chosen to do it is the fastest possible way.

It's rather hard for me to explain how this works. There are two arrays containing DWORDs that are filled with bits on the left and right side of the DWORD (with all binary 1's). It makes a mask with all the bits filled except for the ones I want to set or clear, and then sets them with bitwise operators based on that mask. It seems rather complicated for such a simple task, but it seems like the fastest way I could come up with. It's much faster than setting them bit by bit.

static DWORD __dwFilledBitsRight[] = {
        0x0, 0x1, 0x3, 0x7, 0xF, 0x1F, 0x3F, 0x7F, 0xFF, 0x1FF, 0x3FF, 0x7FF, 0xFFF, 0x1FFF, 0x3FFF,    0x7FFF, 0xFFFF, 0x1FFFF, 0x3FFFF, 0x7FFFF, 0xFFFFF, 0x1FFFFF, 0x3FFFFF, 0x7FFFFF, 0xFFFFFF, 0x1FFFFFF, 0x3FFFFFF, 0x7FFFFFF, 0xFFFFFFF, 0x1FFFFFFF, 0x3FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF
    };

static DWORD __dwFilledBitsLeft[] = {
        0x0, 0x80000000, 0xC0000000, 0xE0000000, 0xF0000000, 0xF8000000, 0xFC000000, 0xFE000000, 0xFF000000, 0xFF800000, 0xFFC00000, 0xFFE00000, 0xFFF00000, 0xFFF80000, 0xFFFC0000, 0xFFFE0000,    0xFFFF0000, 0xFFFF8000, 0xFFFFC000, 0xFFFFE000, 0xFFFFF000, 0xFFFFF800, 0xFFFFFC00, 0xFFFFFE00, 0xFFFFFF00, 0xFFFFFF80, 0xFFFFFFC0, 0xFFFFFFE0, 
        0xFFFFFFF0, 0xFFFFFFF8, 0xFFFFFFFC, 0xFFFFFFFE, 0xFFFFFFFF
    };

    // nStartBitFromLeft must be between 1 and 32... 
    // 1 is the bit farthest to the left (actual bit 31)
    // 32 is the bit farthest to the right (actual bit 0)
    inline void __FillDWORDBits(DWORD *p, int nStartBitFromLeft, int nBits, BOOL bSet)
    {
        DWORD dwLeftMask = __dwFilledBitsLeft[nStartBitFromLeft - 1]; // Mask for data on the left of the bits we want
        DWORD dwRightMask = __dwFilledBitsRight[33 - (nStartBitFromLeft + nBits)]; // Mask for data on the right of the bits we want
        DWORD dwBitMask = ~(dwLeftMask | dwRightMask); // Mask for the bits we want
        DWORD dwOriginal = *p;
        if(bSet) *p = (dwOriginal & dwLeftMask) | (dwOriginal & dwRightMask) | (0xFFFFFFFF & dwBitMask);
        else *p = (dwOriginal & dwLeftMask) | (dwOriginal & dwRightMask) | 0;

    }
oldSkool
  • 1,212
  • 5
  • 14
  • 29
  • That's what I was thinking... but my target platform is i386+ – oldSkool Jan 26 '11 at 00:05
  • personally when I feel something is hard to explain to somebody else I take that as a warning sign. :-) You probably should benchmark your code and compare with setting bit by bit if you haven't already. Have you tried using a union instead? – AndersK Jan 26 '11 at 00:06
  • yeah, I don't trust my benchmarks using the clock ticks in windows... but based on the number of instructions, setting bit by bit had more instructions over the complete iteration. – oldSkool Jan 26 '11 at 00:14
  • As a side-note, your `__Fi...` name gives undefined behavior (names starting with an underscore followed by either another underscore or a capital letter are reserved for the implementation). Personally, I'd separate this into two functions, `fill_bits` and `clear_bits`. Passing a `bool` as a parameter is usually a mistake -- if you insist on this structure, use something like `change_bits` and pass an `enum { CLEAR, FILL};`. as the parameter -- but separate functions would be better, IMO. – Jerry Coffin Jan 26 '11 at 00:19
  • Interesting... I usually use two underscores for functions like these so I can find them faster in my function list... I'm trying to grasp what you're saying here... I don't like the unix style naming conventions (i.e. lower_lower()). Personally I think they make my programs less readable and more cryptic. – oldSkool Jan 26 '11 at 00:24
  • looks correct and must run faster on any i386 like processor – Mandrake Jan 29 '11 at 12:26

2 Answers2

12

How about:

// Create mask of correct length, and shift to the correct position
DWORD mask = ((1ULL << nBits) - 1) << pos;
// Apply mask (or its inverse)
if (bSet)
{
    *p |= mask;
}
else
{
    *p &= ~mask;
}

It's pretty likely that simple bitwise operations will be faster than table lookup on any modern processor.

Note: Depending on the relationship between DWORD and long long on this platform, you may need special handling for the case where nBits == sizeof(DWORD)*8. Or if nBits==0 is not a possibility, you could just do DWORD mask = ((2ULL << (nBits - 1)) - 1) << pos;.

Update: It's been mentioned that the if could potentially be slow, which is true. Here's a replacement for it, but you'd need to measure to see if it's actually any faster in practice.

// A bit hacky, but the aim is to get 0x00000000 or 0xFFFFFFFF
// (relies on two's-complement representation)
DWORD blanket = bSet - 1;
// Use the blanket to override one or other masking operation
*p |=  (blanket | mask);
*p &= ~(blanket & mask);
Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
  • You should be careful that `1L << nBits` is undefined is `nBits` is equal to the length of the word, 32 in this case. If 32 (or 64) is a valid parameter, a special case must be written. – Giuseppe Ottaviano Jan 26 '11 at 00:08
  • The `if` would be slow. It would be nice to avoid it. – ruslik Jan 26 '11 at 00:09
  • Awesome... that is the code that I was thinking of in my head to begin with, but I just couldn't code it, so I ended up with that lookup table monster... thanks a bunch dude :) – oldSkool Jan 26 '11 at 00:10
  • @Giuseppe: Indeed, if `sizeof(long) <= sizeof(DWORD)`. It just so happens I've already updated my answer to mention this! – Oliver Charlesworth Jan 26 '11 at 00:10
  • @PaulB: It's not just whether the assembler is smaller. It's also the memory-access time for performing lookups. – Oliver Charlesworth Jan 26 '11 at 00:18
  • @ruslisk: It's not clear if that if-statement would actually slow things down, especially on modern pipelined CPUs. Those suggested bit manipulation tricks outlined must all be executed in order (since the results of each operation are passed to the next one), whereas in a simple conditional jump both paths may be executed in parallel and one result being discarded. Some compilers, like GNU provide an extension to specify, which path is the un-/likely one to be taken. My gut tells me, that on a modern CPU the if-variant is likely to be the more performant one. – datenwolf Jan 26 '11 at 00:43
  • @datenwolf: At some point, though, a conditional operation must occur, in order to populate the result with one or other intermediate result. Thus a conditional branch must occur. – Oliver Charlesworth Jan 26 '11 at 00:46
  • Yeah, this was the problem I had before when trying this algorithm. The number of bits in "mask" needs to be calculated before shifting. For example, let's say you wanted to fill 4 bits starting at bit 32, it would make a mask that is 4 bits and shift it 32 left, which would cut off most of the mask and put it into the wrong position. I could use a BSF instruction to count the leading bits, but I'm not sure how to implement it yet; i'm working on it now – oldSkool Jan 26 '11 at 00:55
  • @old: "4 bits starting at bit 32"? But the highest bit is 31, no? – Oliver Charlesworth Jan 26 '11 at 09:39
  • My mistake, but the same concept applies... if you shift 4 bits left 31 (the bit position), it will cut off 3 bits. – oldSkool Jan 26 '11 at 12:19
  • @old: well, yes! It doesn't make sense to set 4 bits starting at bit 31! – Oliver Charlesworth Jan 26 '11 at 12:22
  • I need to make a mask of n bits, and put the leftmost bit at bit b, not the rightmost bit – oldSkool Jan 26 '11 at 12:23
  • @old: Well, when setting up `mask`, just shift by `pos-nBits-1` rather than by `pos`. – Oliver Charlesworth Jan 26 '11 at 12:25
  • I will try that out... thanks... I finally settled on making a FFFFFFFF mask and shifting right, then flipping, but I will try out your algorithm because it seems simpler – oldSkool Jan 26 '11 at 12:27
  • Are lookup tables really that slow? I thought that wasn't an issue any more with the newer processors. – oldSkool Jan 26 '11 at 12:32
  • @old: It depends on how the access pattern interacts with the cache. It could potentially be very fast, or very slow. – Oliver Charlesworth Jan 26 '11 at 12:39
  • @Oli Checking the assembly isn't just for the size. It's also to validate the assumptions about loads and stores. Optimizers will often produce surprising results – Paul Beusterien Jan 26 '11 at 15:39
4

This is the way I'd do it. I'd break it into two functions, setbits() and clearbits(). Steps broken out for clarity, and I'm sure it can be far more optimized.

This version is dependent on 32-bit code as it stands. Also, in my world, bit 0 is the rightmost bit. Your mileage may vary.

setbits( DWORD *p , int offset , int len )
{
  // offset must be 0-31, len must be 0-31, len+offset must be 0-32
  int   right_shift = ( !len ? 0 : 32 - (len+offset) ) ;
  int   left_shift  = offset ;
  DWORD right_mask  = 0xFFFFFFFF >> right_shift  ;
  DWORD left_mask   = 0xFFFFFFFF << left_shift   ;
  DWORD mask        = left_mask & right_mask     ;

  *p |= mask ;

  return ;
}

clearbits( DWORD *p , int offset , int len )
{
  // offset must be 0-31, len must be 0-31, len+offset must be 0-32
  int   right_shift = ( !len ? 0 : 32 - (len+offset) ) ;
  int   left_shift  = offset ;
  DWORD right_mask  = 0xFFFFFFFF >> right_shift   ;
  DWORD left_mask   = 0xFFFFFFFF << left_shift    ;
  DWORD mask        = ~( left_mask & right_mask ) ;

  *p &= mask ;

  return ;
}

I stumbled across this improved version whilst looking for something else today. Courtesy of Sean Anderson's Bit Twiddling Hacks at Stanford University:

// uncomment #define to get the super scalar CPU version.
// #define SUPER_SCALAR_CPU
void setbits( unsigned int *p , int offset , int len , int flag )
{
  unsigned int mask = ( ( 1 << len ) - 1 ) << offset ;

#if !defined( SUPER_SCALAR_CPU )
  *p ^= ( - flag ^ *p ) & mask ;
#else
  // supposed to be some 16% faster on a Intel Core 2 Duo than the non-super-scalar version above
  *p = (*p & ~ mask ) | ( - flag & mask ) ;
#endif

  return ;

}

Much depends on your compiler, though.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
  • yeah, I like this... I refer to the bits in the same order as you, but in the use of this function, it is filling bits in a bit mask by coordinate (x ascends going right) – oldSkool Jan 26 '11 at 01:19
  • It's very nice, but depends on context. When `bool bSet` is a compile time constant, explicit despatching is the best way. – ruslik Jan 26 '11 at 03:08