415

Is there a faster way than x >= start && x <= end in C or C++ to test if an integer is between two integers?

UPDATE: My specific platform is iOS. This is part of a box blur function that restricts pixels to a circle in a given square.

UPDATE: After trying the accepted answer, I got an order of magnitude speedup on the one line of code over doing it the normal x >= start && x <= end way.

UPDATE: Here is the after and before code with assembler from XCode:

NEW WAY

// diff = (end - start) + 1
#define POINT_IN_RANGE_AND_INCREMENT(p, range) ((p++ - range.start) < range.diff)

Ltmp1313:
 ldr    r0, [sp, #176] @ 4-byte Reload
 ldr    r1, [sp, #164] @ 4-byte Reload
 ldr    r0, [r0]
 ldr    r1, [r1]
 sub.w  r0, r9, r0
 cmp    r0, r1
 blo    LBB44_30

OLD WAY

#define POINT_IN_RANGE_AND_INCREMENT(p, range) (p <= range.end && p++ >= range.start)

Ltmp1301:
 ldr    r1, [sp, #172] @ 4-byte Reload
 ldr    r1, [r1]
 cmp    r0, r1
 bls    LBB44_32
 mov    r6, r0
 b      LBB44_33
LBB44_32:
 ldr    r1, [sp, #188] @ 4-byte Reload
 adds   r6, r0, #1
Ltmp1302:
 ldr    r1, [r1]
 cmp    r0, r1
 bhs    LBB44_36

Pretty amazing how reducing or eliminating branching can provide such a dramatic speed up.

Community
  • 1
  • 1
jjxtra
  • 20,415
  • 16
  • 100
  • 140
  • 30
    Why are you concerned that this isn't fast enough for you? – Matt Ball Jun 13 '13 at 19:22
  • 10
    Is this particular test the bottleneck in your application? – cdhowie Jun 13 '13 at 19:23
  • 14
    Don't worry about it. The optimizer is extremely good. – SLaks Jun 13 '13 at 19:23
  • 101
    Who cares why, its an interesting question. Its just a challenge for the sake of a challenge. – David says Reinstate Monica Jun 13 '13 at 19:23
  • 14
    @Dgrin91: Not really. It depends on exactly which compiler, optimizer, platform, data type, and who knows what else. – SLaks Jun 13 '13 at 19:24
  • 47
    @SLaks So we should just ignore all such questions blindly and just say "let the optimizer do it?" – David says Reinstate Monica Jun 13 '13 at 19:27
  • 91
    it doesn't matter why the question is being asked. It's a valid question, even if the answer is *no* – tay10r Jun 13 '13 at 19:28
  • 7
    I would suggest this question is meaningless *in `c`*. If you asked "in assembly on a very specific platform", there might be a reasonable answer. So as asked, it is not a valid question, even an academic one. – BoBTFish Jun 13 '13 at 19:31
  • 43
    This is a bottleneck in a function in one of my apps – jjxtra Jun 13 '13 at 19:32
  • start, end and value are all between 0 and n, where n is usually less than 128 – jjxtra Jun 13 '13 at 19:33
  • 7
    @Dgrin91: No; we should ask such questioners to provide more detail. – SLaks Jun 13 '13 at 19:46
  • 4
    did you try the non lazy and `x >= start & x <= end` (to avoid the extra branch) – ratchet freak Jun 14 '13 at 01:48
  • 16
    @SLaks regarding _we should ask such questioners to provide more detail_: That's not what you did. Your comment was _Don't worry about it. The optimizer is extremely good._ – jogojapan Jun 14 '13 at 02:35
  • "This is part of a box blur function" Doesn't iphone have shaders? – SigTerm Jun 14 '13 at 02:44
  • 1
    @SigTerm yes it has shaders but my app is using quartz2d and cgbitmap context for everything. – jjxtra Jun 14 '13 at 14:08
  • 3
    I think there was a bug in your original code. It would not increment p if the first comparison was false. The new code always increments p. This may explain the bulk of your speed up. – jxh Jun 14 '13 at 16:24
  • 3
    You were skating on thin ice with your old way, since the increment wasn't guaranteed to happen in the sequence you might assume. For that matter I don't understand why it's there at all, since you seem to be incrementing the value and not a pointer. – Mark Ransom Jun 14 '13 at 16:25
  • @jxh That is intentional. I did not want p to increment in the original code to save a few cycles. Once it got past the right edge or bottom edge of the circle the condition would return false without doing the second comparison. Visually, the blur looks the same with both methods. – jjxtra Jun 14 '13 at 16:26
  • 5
    I wouldn't recommend the usage of `p++` rather than just `p` inside the macro anyway. It may make your code slightly shorter as you don't have to increment `p` after usage of the macro, but by including it in the macro you're possibly violating an assumption that most programmers probably make (that a bounds check will not modify the values involved in the check).And if a difference of a few cycles per macro use on average is enough to make a big perf difference, then you may want to look into reducing the number of times the macro is used. microopts sometimes ignore the actual perf issues – JAB Jun 14 '13 at 16:28
  • @JAB Good point, I will rename the macro to indicate that there is an increment going on – jjxtra Jun 14 '13 at 16:29
  • I take it back, I didn't realize that `&&` defined a sequence point. See http://stackoverflow.com/questions/4176328/undefined-behavior-and-sequence-points. And I imagine you call the macro with a pointer reference i.e. `*p`? That would work then, but it's not the easiest code to follow. – Mark Ransom Jun 14 '13 at 16:30
  • @MarkRansom Correct. Yes it's not the easiest code to follow but it is a pointer reference. – jjxtra Jun 14 '13 at 16:33
  • @PsychoDad Oh wait, if it's a pointer reference and based on the usage (should've noticed it by the use of dot notation next to the incrementing), is `p` an iterator? I guess I can see how that could provide a performance impact if the overloaded `++` in the iterator doesn't get inlined/optimized. – JAB Jun 14 '13 at 16:36
  • @JAB It's not an iterator, it's just a pointer to a struct. – jjxtra Jun 14 '13 at 16:37
  • 1
    FYI you should turn those `#define`'s into inline functions *(it will not hurt performance at all)*. See [here](http://stackoverflow.com/a/3554543/238419) for the reason why. – BlueRaja - Danny Pflughoeft Jun 14 '13 at 16:42
  • @PsychoDad It's true that I don't have all that much experience with C++, but as far as I can tell a pointer-to-struct (or to class instance) still needs `->` to access its members, just as in C. I don't believe it would be necessary if it were a reference to the struct, but that would not allow the incrementing without an overload for the `++` operator. – JAB Jun 14 '13 at 16:48
  • What about the data loading optimization? Did you try to use smaller data types, save it in offset of the other register? When you have 2 values in one 32b data type, you can still load it with `ldr` and then use offsets which you will compare. – bartimar Jun 14 '13 at 17:40
  • 24
    You really should have posted your **real** code much sooner. The `++` makes a **big** difference... – Oliver Charlesworth Jun 14 '13 at 18:00
  • For the curious, this function was used in the blur tool in the app I wrote, You Doodle for iOS - http://bit.ly/YouDoodleApp – jjxtra Jan 12 '15 at 02:48
  • 1
    Those 2 versions aren't equivalent! The `&&` one only increments `p` if the first compare is false; the unsigned-compare trick version increments unconditionally. That's not a drop-in replacement. As OliverCharlesworth commented, that's why the compiler couldn't do the optimization itself. (Although older GCC did sometimes miss that optimization for runtime variables, but that's fixed now. Without the side effect, both ways should compile the same.) – Peter Cordes Dec 21 '20 at 19:17

7 Answers7

564

There's an old trick to do this with only one comparison/branch. Whether it'll really improve speed may be open to question, and even if it does, it's probably too little to notice or care about, but when you're only starting with two comparisons, the chances of a huge improvement are pretty remote. The code looks like:

// use a < for an inclusive lower bound and exclusive upper bound
// use <= for an inclusive lower bound and inclusive upper bound
// alternatively, if the upper bound is inclusive and you can pre-calculate
//  upper-lower, simply add + 1 to upper-lower and use the < operator.
    if ((unsigned)(number-lower) <= (upper-lower))
        in_range(number);

With a typical, modern computer (i.e., anything using twos complement), the conversion to unsigned is really a nop -- just a change in how the same bits are viewed.

Note that in a typical case, you can pre-compute upper-lower outside a (presumed) loop, so that doesn't normally contribute any significant time. Along with reducing the number of branch instructions, this also (generally) improves branch prediction. In this case, the same branch is taken whether the number is below the bottom end or above the top end of the range.

As to how this works, the basic idea is pretty simple: a negative number, when viewed as an unsigned number, will be larger than anything that started out as a positive number.

In practice this method translates number and the interval to the point of origin and checks if number is in the interval [0, D], where D = upper - lower. If number below lower bound: negative, and if above upper bound: larger than D.

Ziezi
  • 6,375
  • 3
  • 39
  • 49
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 3
    @OliCharlesworth: Yes, but he's said both are greater than 0, so it can't overflow (i.e, upper-lower < upper). – Jerry Coffin Jun 13 '13 at 19:38
  • 1
    Nice, but Im curious now. Thinking in machine cycles, a subtractor isn't more expensive than a comparison? – Amadeus Jun 13 '13 at 19:38
  • 2
    @JerryCoffin: Ah, I hadn't noticed that! – Oliver Charlesworth Jun 13 '13 at 19:38
  • 9
    @TomásBadan: They'll both be one cycle on any reasonable machine. What's expensive is the branch. – Oliver Charlesworth Jun 13 '13 at 19:39
  • upper-lower can be pre-calculated in my case as well since my data structure contains a start and end value, nice! – jjxtra Jun 13 '13 at 19:41
  • 1
    @AK4749: Which is why questions that produce nuggets like these should also be rewarded with an upvote. – jxh Jun 13 '13 at 19:43
  • 3
    Additional branching is done due to short-circuiting? If this is the case, would `lower <= x & x <= upper` (instead of `lower <= x && x <= upper`) result in better performance as well? – Markus Mayr Jun 13 '13 at 19:45
  • 6
    @AK4749, jxh: As cool as this nugget is, I'm hesitant to upvote, because there's unfortunately nothing to suggest this is any faster in practice (until someone does a comparison of resulting assembler and profiling info). For all we know, the OP's compiler may render the OP's code with a single branch opcode... – Oliver Charlesworth Jun 13 '13 at 19:50
  • 1
    @MarkusMayr, an optimizer might make the substitution for you using the [as-if rule](http://en.cppreference.com/w/cpp/language/as_if) since comparing integers has no side effects. – Mark Ransom Jun 13 '13 at 19:54
  • 163
    WOW!!! This resulted in an order of magnitude improvement in my app for this specific line of code. By precomputing upper-lower my profiling went from 25% time of this function to less than 2%! Bottleneck is now addition and subtraction operations, but I think it might be good enough now :) – jjxtra Jun 13 '13 at 19:54
  • 2
    @OliCharlesworth, even if testing shows that it's no different or even worse on most processors and compilers, if there's one where it's better then it's a worthwhile answer. – Mark Ransom Jun 13 '13 at 19:55
  • @MarkRansom I would expect any compiler to optimize the code in that way. But if this is the case, then I can't see any reason why the code in the answer should be substantially faster than the code provided by the OP. – Markus Mayr Jun 13 '13 at 19:55
  • 2
    @PsychoDad: To satisfy the curious amongst us, would you mind posting the assembler your compiler generated in each case? – Oliver Charlesworth Jun 13 '13 at 19:57
  • @PsychoDad typically compilers support the `-S` option to produce an assembly file. I'd actually be interested in the assembly of the original (slow) code as well. – Bryan Olivier Jun 13 '13 at 20:08
  • Found it in Xcode, since the file it's in is over 1000 lines now the tricky part will be finding the exact piece :) – jjxtra Jun 13 '13 at 20:08
  • 3
    I *think* this is the omptimized code but I could be wrong, someone tell me if this looks way off: Ltmp1313: ldr r0, [sp, #176] @ 4-byte Reload ldr r1, [sp, #164] @ 4-byte Reload ldr r0, [r0] ldr r1, [r1] sub.w r0, r9, r0 cmp r0, r1 blo LBB44_30 – jjxtra Jun 13 '13 at 20:15
  • 1
    @PsychoDad Looks like it, except `number` and `upper-lower` seem to come from a `struct` (or global). – Bryan Olivier Jun 13 '13 at 20:33
  • Yes they come from a struct with start, end and diff properties – jjxtra Jun 13 '13 at 20:43
  • 2
    I think this is the slower version: Ltmp1301: ldr r1, [sp, #172] @ 4-byte Reload ldr r1, [r1] cmp r0, r1 bls LBB44_32 mov r6, r0 b LBB44_33 LBB44_32: ldr r1, [sp, #188] @ 4-byte Reload adds r6, r0, #1 Ltmp1302: ldr r1, [r1] cmp r0, r1 bhs LBB44_36 – jjxtra Jun 13 '13 at 20:52
  • For the curious the algorithm involved is a box blur that is restricting the blurred pixels to a circle. The inclusive check is checking if the current pixel is a point in a circle. – jjxtra Jun 13 '13 at 20:55
  • @OliCharlesworth absolutely. I realize that the optimizer may end up producing comparable code, so I do see your point to an extent. I also see, however, that Jerry has some theoretical backing to his answer, which makes it (IMO) a good answer. I can see how it's not an obvious "good SO answer" though. I'm so conflicted – im so confused Jun 13 '13 at 21:11
  • 3
    [A pastebin](http://pastebin.com/JG4mNVyR) of the assembler in a hopefully more readable format. – jxh Jun 13 '13 at 22:15
  • 2
    @MarkusMayr Checking with gcc explorer, using `&&` and using `&` produce exactly the same code, which is the short circuit (`&&`) method. – SirGuy Jun 13 '13 at 23:47
  • @BlueRaja-DannyPflughoeft the entire app is done with core graphics so at this point an OpenGL rewrite would be somewhat painful. Performance is actually decent even on my iPhone 4S with a 2048x2048 image. – jjxtra Jun 14 '13 at 00:04
  • 1
    @PsychoDad YOu really ought to put these details in your question (or edit them into the accepted answer as appropriate). Comments are a bad place for good information like this. The information gets lost, and sooner or later, it's just hard for other people to follow. – George Stocker Jun 14 '13 at 02:19
  • 32
    Ah, now the @PsychoDad has updated the question, it's clear why this is faster. The **real** code has a side-effect in the comparison, which is why the compiler couldn't optimize the short-circuit away. – Oliver Charlesworth Jun 14 '13 at 17:57
  • I wonder if it is possible to apply this trick in Java, because Java has no unsigned integers. – Konstantin Milyutin Nov 11 '13 at 13:36
  • Do you happen to have a reference for the 'old trick' that I could cite? – Patrick Sanan Nov 14 '13 at 07:58
  • @TheNobleSunfish: Sorry, but no, not really. – Jerry Coffin Nov 14 '13 at 14:38
  • 2
    @PatrickSanan: a bit late, I know, but I think this particular trick is covered in detail in ["Hacker's Delight" by Henry S. Warren](https://www.amazon.co.uk/Hackers-Delight-Henry-S-Warren/dp/0321842685) (along with a lot of other cool low-level hacks). – Paul R Nov 18 '16 at 14:30
  • Isn't this undefined behaviour if `number - lower > INT_MAX`? – Hugh Jun 12 '21 at 03:38
  • 1
    GCC generates the same code for naive vs optimized: https://godbolt.org/z/zqhMs8Yzn – PBS Sep 25 '22 at 01:57
23

It's rare to be able to do significant optimizations to code on such a small scale. Big performance gains come from observing and modifying the code from a higher level. You may be able to eliminate the need for the range test altogether, or only do O(n) of them instead of O(n^2). You may be able to re-order the tests so that one side of the inequality is always implied. Even if the algorithm is ideal, gains are more likely to come when you see how this code does the range test 10 million times and you find a way to batch them up and use SSE to do many tests in parallel.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • 20
    Despite the downvotes I stand by my answer: The generated assembly (see the pastebin link in a comment to the accepted answer) is pretty terrible for something in the inner loop of a pixel processing function. The accepted answer is a neat trick but its dramatic effect is far beyond what is reasonable to expect for eliminating a fraction of a branch per iteration. Some secondary effect is dominating, and I still expect that an attempt to optimize the whole process over this one test would leave the gains of clever range comparison in the dust. – Ben Jackson Jun 14 '13 at 07:58
17

It depends on how many times you want to perform the test over the same data.

If you are performing the test a single time, there probably isn't a meaningful way to speed up the algorithm.

If you are doing this for a very finite set of values, then you could create a lookup table. Performing the indexing might be more expensive, but if you can fit the entire table in cache, then you can remove all branching from the code, which should speed things up.

For your data the lookup table would be 128^3 = 2,097,152. If you can control one of the three variables so you consider all instances where start = N at one time, then the size of the working set drops down to 128^2 = 16432 bytes, which should fit well in most modern caches.

You would still have to benchmark the actual code to see if a branchless lookup table is sufficiently faster than the obvious comparisons.

Andrew Prock
  • 6,900
  • 6
  • 40
  • 60
  • So you would store some sort of lookup given a value, start and end and it would contain a BOOL telling you if it was in between? – jjxtra Jun 13 '13 at 19:33
  • Correct. It would be a 3D lookup table: `bool between[start][end][x]`. If you know what your access pattern is going to look like (for example x is monotonically increasing) you can design the table to preserve locality even if the entire table doesn't fit in memory. – Andrew Prock Jun 13 '13 at 19:36
  • I'll see if I can get around to trying this method and seeing how it goes. I'm planning on doing it with a bit vector per line where the bit will be set if the point is in the circle. Think that will be faster than a byte or int32 vs the bit masking? – jjxtra Jun 19 '13 at 18:50
8

For any variable range checking:

if (x >= minx && x <= maxx) ...

It is faster to use bit operation:

if ( ((x - minx) | (maxx - x)) >= 0) ...

This will reduce two branches into one.

If you care about type safe:

if ((int32_t)(((uint32_t)x - (uint32_t)minx) | ((uint32_t)maxx - (uint32_t)x)) > = 0) ...

You can combine more variable range checking together:

if (( (x - minx) | (maxx - x) | (y - miny) | (maxy - y) ) >= 0) ...

This will reduce 4 branches into 1.

It is 3.4 times faster than the old one in gcc:

enter image description here

skywind3000
  • 408
  • 5
  • 9
  • Is the reason this works because a bit-wise negative and positive will always produce a negative? I ask because it's a bit out of my weight class, but still wanna understand it. – netsplit Feb 12 '23 at 17:46
4

This answer is to report on a testing done with the accepted answer. I performed a closed range test on a large vector of sorted random integer and to my surprise the basic method of ( low <= num && num <= high) is in fact faster than the accepted answer above! Test was done on HP Pavilion g6 (AMD A6-3400APU with 6GB ram. Here's the core code used for testing:

int num = rand();  // num to compare in consecutive ranges.
chrono::time_point<chrono::system_clock> start, end;
auto start = chrono::system_clock::now();

int inBetween1{ 0 };
for (int i = 1; i < MaxNum; ++i)
{
    if (randVec[i - 1] <= num && num <= randVec[i])
        ++inBetween1;
}
auto end = chrono::system_clock::now();
chrono::duration<double> elapsed_s1 = end - start;

compared with the following which is the accepted answer above:

int inBetween2{ 0 };
for (int i = 1; i < MaxNum; ++i)
{
    if (static_cast<unsigned>(num - randVec[i - 1]) <= (randVec[i] - randVec[i - 1]))
        ++inBetween2;
}

Pay attention that randVec is a sorted vector. For any size of MaxNum the first method beats the second one on my machine!

rezeli
  • 143
  • 2
  • 1
    My data is not sorted and my tests are on iPhone arm CPU. Your results with different data and CPU may differ. – jjxtra Feb 03 '17 at 14:24
  • sorted in my test was only to make sure upper limit is not smaller than lower limit. – rezeli Feb 14 '17 at 23:53
  • 2
    Sorted numbers mean branch prediction will be very reliable and get all branches right except for a few at the switchover points. The advantage of branchless code is that it will get rid of these kinds of misspredictions on unpredictable data. – Andreas Klebinger Feb 07 '19 at 00:07
0

I can tell you exactly why this would matter. Imagine you're simulating an MMU. You are constantly having to make sure a given memory address exists with a given page set. Those little bits add up very quickly because you're always saying

  • Is this address valid?
  • What page is this address part of?
  • What rights does this page have?
user500123
  • 617
  • 6
  • 14
-4

Is it not possible to just perform a bitwise operation on the integer?

Since it has to be between 0 and 128, if the 8th bit is set (2^7) it is 128 or more. The edge case will be a pain, though, since you want an inclusive comparison.

icedwater
  • 4,701
  • 3
  • 35
  • 50