7

In the binary search algorithm, we set the mid as:

mid = (start + end)/2 which is same as

mid = start/2 + end/2 and also equal to

mid = start + (end - start)/2

but all of the three give different results being the same arithmetic expression. How do these vary during the computation process?

This was the code to find the last occurrence of an element in a vector array using binary search:

int lastOccurrence(vector<int> arr, int size, int key){
    int start = 0, end = size - 1;
    int last = -1;
    // int mid = start/2 + end/2;
    int mid;
    while(start <= end){
        // mid = start + (end - start)/2;
        mid = (start + end)/2;
        if(key == arr[mid]){
            last = mid;
            start = mid + 1;
        }
        else if(key > arr[mid]){
            start = mid + 1;
        }
        else{
            end = mid - 1;
        }
        cout << "start: " << start << "\tend: " << end << "\tmid: " << mid << endl;
    }
    return last;
}

The values being passed to the function are:

int main(){
    vector<int> arr = {1,2,3,4,4,4,4,5,6,7,11};
    int size = arr.size();
    int key = 4;
    cout << "First occurrence of " << key << " is at index " << firstOccurrence(arr, size, key) << endl;

    cout << "Last occurrence of " << key << " is at index " << lastOccurrence(arr, size, key) << endl;

    return 0;
}

If the mid element equals the required "key" element then the mid index is stored in a variable and start is updated to mid + 1 so that it can search the right part of the array for any other occurrence of the "key". If the "key" is found to be less than the mid element it implies that the element is not present beyond the mid element and the end is updated to mid - 1 to search in the left part of the array, and similarly to search the right part if "key" is found to be greater than the mid element.

It gave different results when mid = start/2 + end/2 was used and mid = (start + end)/2 was used. How is this affected during the computation process?

Vlad from Moscow
  • 301,070
  • 26
  • 186
  • 335
user19117411
  • 105
  • 5
  • 1
    What is the input to the function? Have you tried to step through the code in a [debugger](https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) while monitoring variables and their values, to see what really happens? And you *do* remember that an absolute requirement for binary search to work is that the data you search through is *ordered*? – Some programmer dude Jul 10 '23 at 11:30
  • 2
    Not going to re-calculate each of these, but it boils down to your values being integers. Integer division will cut off any decimal places. – nick Jul 10 '23 at 11:31
  • 13
    `(1+3)/2` is not the same as `1/2+3/2`. – Daniel Langr Jul 10 '23 at 11:31
  • 1
    `mid = start + (end - start)/2` is the same as `2*mid = 2*start + (end - start)` is the same as `2*mid = 2*start + end - start` is the same as `2*mid = start + end` is the same as `mid = (start + end)/2`. Please show your values. (this all assuming `start+end` doesn't overflow) – 500 - Internal Server Error Jul 10 '23 at 11:32
  • 1
    they are not mathematically equivalent – Pepijn Kramer Jul 10 '23 at 11:36
  • 1
    Each operation you describe involves operands of type `int`, so the result is an `int`. So `start/2` rounds towards zero (e.g. `3/2` gives `1`, not `1.5`), as does `end/2`. As an example, `1/2 + 3/2` computes `1/2` and `3/2` separately, producing `0` and `1` respectively, and their sum is `1`. Whereas `(1+3)/2` computes `1+3` first, giving `4`, and divides that by `2`, producing `2`. – Peter Jul 10 '23 at 11:39

4 Answers4

6

For starters the function is invalid. When size is equal to 1 then you have due to this statement

int start = 0, end = size - 1;

that end is equal to 0.

In this case the while loop

while(start < end){

will be skipped. And the function will return the value of last equal to -1

int last = -1;

// ...

return last;

though arr[0] can be equal to key.

As for your question then when start and end are both odd values then the value of mid will be one less for the expression

start/2 + end/2

then for other two expressions.

As for this expression (start + end)/2 then it is unsafe due to a possible overflow of the sum start + end.

Pay attention to that in C++20 there is function std::midpoint declared in header <numeric> that can be and should be used instead of manually written expressions.

As for the function as whole then there is already standard algorithm std::upper_bound declared in header <algorithm> that can be adapted for using instead of the function.

Vlad from Moscow
  • 301,070
  • 26
  • 186
  • 335
4

You need to consider that integer arithmetic cuts off any fractional parts, hence depending on the last bit of start and stop you get different results.

Suppose they are

 start = M*2 + a;
 end = N*2 + b;

Where M and N are integers and a and b are either 1 or 0, then you get

mid_0 = (start + end)/2 = M+N + (a+b) / 2
mid_1 = start/2 + end/2 = M+N
mid = start + (end - start)/2 = M*2 + a + (N-M) + (b-a)/2 = M+N + a + (b-a)/2 

Only the second expression does not depend on whether start or end are even or odd. I didn't actually bother to work out (via a 2x2 table) whether a + (b-a)/2 yields the same result as (a+b)/2. However, when dealing with integer arithmetic you better do not rely on intuition, because it's too easy to be off by one (or more). Moreover, I did not yet consider integer overflow. When start+end overflows then (start/2) + (end/2) does not.

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
  • It means that while doing it on paper, all of the three expressions turn out to be the same expression as start/2 + end/2 but affect the solution for intergers and float during programming. – user19117411 Jul 10 '23 at 12:23
  • 1
    @user19117411 no, also on paper `5 / 2 + 5/2` is `2 + 2 = 4` is not the same as `(5+5)/10`. Its integer arithmetics. You can do it on paper. Its just not the "normal" arithmetics you are used to (where `(a+b)/2 == a/2 + b/2`) – 463035818_is_not_an_ai Jul 10 '23 at 12:28
  • It is not mathematics with real numbers or even natural numbers `x ∉ ℝ ∧ x ∉ ℕ`, replace `x` with any of your variables `start`, `end`, `mid`. So the normal maths does not work. Your `x` are `int`, so you are dealing with bits and bytes and unsigned bit, register overflow, carry bit, when really dealing with integer mathematics at the core. – destructioneer Jul 16 '23 at 15:00
3

All of these are attempts to prevent overflow when calculating the midpoint:

(a + b) / 2

The best way (supposedly) is this:

a + b = (a ^ b) + (a & b) << 1
(a + b) / 2 = (a ^ b) / 2 + (a & b)

The identity comes from Don Knuth's book, The Art of Computer Programming, Vol. 4.

This is because Don's formula is supposedly bulletproof. It should work equally well on negative and positive indices. Note that shifting right (including arithmetically) is not always the same as dividing by 2.

user1095108
  • 14,119
  • 9
  • 58
  • 116
  • Casting to unsigned int works, too, if the code is using signed integers as array indices: `(a + b) / 2 = (int)(((uint)a + (uint)b) >> 2)` . – CookedCthulhu Jul 27 '23 at 15:06
  • @CookedCthulhu AFAIK, ((uint)a + (uint)b) can overflow and why shift by 2 places? Typo and you didn't understand the point. – user1095108 Jul 27 '23 at 18:26
  • The shift was a typo but the formula cannot overflow because positive signed int `a + b` is at most `2^32 - 2`, which is valid if you cast both a and b to unsigned first. Unsigned shift doesn't do sign extension, so `((uint)a + (uint)b) >> 1` is at most `2^31 - 1`. This formula was used in C#'s source code multiple times, so there is a chance that it's either faster or at least more readable than the supposedly best way. – CookedCthulhu Jul 28 '23 at 07:11
  • @CookedCthulhu not true, there is no sign bit in unsigned integers. – user1095108 Jul 28 '23 at 07:19
  • "Unsigned shift doesn't do sign extension (because there is no sign bit in unsigned integers)". Better? It's what I wrote. – CookedCthulhu Jul 28 '23 at 07:29
  • @CookedCthulhu you just dont understand. If there was a sign bit and it were clear, then what you wrote would be true, but as it is overflow can happen. – user1095108 Jul 28 '23 at 07:36
  • Show me in which case it overflows. https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.BinarySearch.cs,37ba64987104e3dc,references – CookedCthulhu Jul 28 '23 at 07:40
  • I am completely uninterested in c#. – user1095108 Jul 28 '23 at 07:42
  • Then show me in C++ where it overflows. – CookedCthulhu Jul 28 '23 at 07:50
  • (~0u + 1u) >> 1; – user1095108 Jul 28 '23 at 09:11
  • `~0u` is not representable as a positive signed int. No addition of two valid signed array indices can represent a number that overflows as unsigned. If the index is/becomes invalid, the binary search is out of range and has to be stopped anyway. start/mid/end must all satisfy `0 <= n <= size`. – CookedCthulhu Jul 28 '23 at 09:37
2

As mentioned in the previous answer, (start + end)/2 handles residuals better. However, start/2 + end/2 is not susceptible to overflow while (start + end)/2 is.

So if you're handling arrays with potentially >2G elements, it is advised to either make start/end/mid 64b ints or prefer the start/2 + end/2 form.

Ofek Shilon
  • 14,734
  • 5
  • 67
  • 101