12

I found some interesting bit twiddling in "source\common\unicode\utf.h" file of ICU library (International Components for Unicode). The bit twiddling is intended for checking whether a number is in a particular range.

// Is a code point in a range of U+d800..U+dbff?
#define U_IS_LEAD(c) (((c)&0xfffffc00)==0xd800)

I have figured out the magic number (0xfffffc00) come from:

MagicNumber = 0xffffffff - (HighBound - LowBound)

However, I also found that the formula doesn't apply to every arbitrary range. Does somebody here know in what circumstance the formula works?

Is there another bit twiddling for checking whether a number is in particular range?

David Jones
  • 4,766
  • 3
  • 32
  • 45
Astaroth
  • 2,241
  • 18
  • 35
  • for arbitrary ranges you'll need a different method: [Fastest way to determine if an integer is between two integers (inclusive) with known sets of values](https://stackoverflow.com/q/17095324/995714) – phuclv May 25 '19 at 12:42

3 Answers3

12

For these tricks to apply, the numbers must have some common features in their binary representation.

0xD800 == 0b1101_1000_0000_0000
0xDBFF == 0b1101_1011_1111_1111

What this test really does is to mask out the lower ten bits. This is usually written as

onlyHighBits = x & ~0x03FF

After this operation ("and not") the lower ten bits of onlyHighBits are guaranteed to be zero. That means that if this number equals the lower range of the interval now, it has been somewhere in the interval before.

This trick works in all cases where the lower and the higher limit of the interval start with the same digits in binary, and at some point the lower limit has only zeroes while the higher limit has only ones. In your example this is at the tenth position from the right.

Roland Illig
  • 40,703
  • 10
  • 88
  • 121
  • Can you provide any references for "usually written as"? Personally I find `a &~ b` instead of `a & ~b` less intuitive and `a & b == c` more intuitive than `a & ~d == e` because there are fewer operations even if it is just my personal preference. – CB Bailey Jan 01 '11 at 13:03
  • 3
    Be aware that `a & b == c` doesn't mean what you probably think it means (it means `a & (b == c)`). `a &~ b` is lexically identical to `a & ~b`, and I agree that the latter is a better transcription of it, if only because that's the way it's usually done. – Barry Kelly Jan 01 '11 at 13:45
4

If you do not have 2^x boundaries type might use the following trick:

if x >= 0 and x < N you can check both by:

  if Longword( x ) < Longword( N ) then ...

This works due to the fact that negative numbers in signed numbers correspond to the largest numbers in unsigned datatypes.

You could extend this (when range checking is DISABLED) to:

  if Longword( x - A ) < Longword ( ( B - A ) ) then ...

Now you have both tests (range [ A, B >) in a SUB and a CMP plus a single Jcc, assuming (B - A ) is precalculated.

I only use these kind of optimizations when really needed; eg they tend to make your code less readable and it only shaves off a few clock cycles per test.

Note to C like language readers: Longword is Delphi's unsigned 32bit datatype.

Ritsaert Hornstra
  • 5,013
  • 1
  • 33
  • 51
3

The formula works whenever the range you are looking for starts at a multiple of a power of 2 (that is, 1 or more bits at the low end of the binary form of the number ends in 0) and the size of the range is 2^n-1 (that is, low&high == low and low|high == high).

Vatine
  • 20,782
  • 4
  • 54
  • 70
  • have you tested it? Supposing the number is `9` and the range is `8..8+(2^14-1)`, the formula doesn't apply to this case. – Astaroth Jan 01 '11 at 15:09
  • Well... The N needs to be no larger than the number of 0 at the end of the base number (so for 8, N could be in the range 1-3). I thought tha to be too obvious to mention. – Vatine Jan 02 '11 at 12:29