1

I am working on an assignment for my class and the goal of this function is to use Binary Sort on an array of a struct, and return the index of the very first location that a last name is found (even if there are multiple last names, just return the first one). My code works almost perfectly for what I am trying to do, but when I print the index, the output I am getting is 1 too many. For example, if I call my function like this with the string "Zulauf" as the last name:

cout << binaryFindFirstByLastName("Zulauf", person, total) << endl;

I get 99811 instead of its actual location of 99812 (this is obviously reading from a large file). Any help or general advice would be greatly appreciated, thank you!

int binaryFindFirstByLastName(const std::string& value, const Person* array, int size) {
int low = 0;
int high = size-1;
int mid = (low + high) / 2;
while (low + 1 != high) {
    mid = (low + high) / 2;
    if (array[mid].last < value) {
        low = mid;
    }
    else {
        high = mid;
    }
    mid = (low + high) / 2;
}
if (high > size || array[high].last != value) {
    return -1;
}
else return high;
}
  • once you find a match walk backwards until you don't have a match. The last match you visit is the first occurrence. – NathanOliver Apr 09 '18 at 03:12
  • 2
    On a piece of paper, write down an array with two elements, and then manually walk through this function, one line at a time, searching for the 2nd element. This is simple enough to be done by hand, and you will see exactly why you end up with the wrong answer. – Sam Varshavchik Apr 09 '18 at 03:12
  • You're kind of getting paranoid with the `mid = (low + high) / 2;`. Just once at the beginning of the loop body is enough. – eesiraed Apr 09 '18 at 03:27

2 Answers2

1

For completeness, in the real world, we'd use the ready-made library template function std::lower_bound:

c++11 version:

#include <algorithm>

struct Person
{
    std::string last;
};

struct last_is_less
{
    bool operator()(std::string const& l, Person const& r) const
    {
        return l < r.last;
    }

    bool operator()(Person const& l, std::string const& r) const
    {
        return l.last < r;
    }
};

int binaryFindFirstByLastName(const std::string& value, const Person* array, int size) {
    auto first = array;
    auto last = array + size;
    auto i = std::lower_bound(first, last, value, last_is_less());
    if (i == last || i->last != value)
        return -1;
    return int(std::distance(first, i));
}

c++14 version, using free functions:

bool last_name_is_less(std::string const& l, Person const& r)
{
    return l < r.last;
}

bool last_name_is_less(Person const& l, std::string const& r)
{
    return l.last < r;
}

// using lambda to aid in expressing semantic intent
//
int binaryFindFirstByLastName2(const std::string& value, const Person* array, int size) {

    auto first = array;
    auto last = array + size;

    auto to_index = [&](auto iter) 
    {
        if (iter == last || iter->last != value)
            return -1;
        return int(std::distance(first, iter));
    };

    return to_index(std::lower_bound(first, last, 
                                     value, 
                                     [](auto&& l, auto&& r) 
                                     { 
                                         return last_name_is_less(l, r); 
                                     }));
}
Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
0

Let's just go through the algorithm step by step. I'm going to be using ints here just for simplicity.

First, we have the loop condition. We should use low < high since we want to narrow the search down to either one element which we can later check if it's the target outside of the loop, or in the case where low ends up being high + 1, a search miss.

Now, let's look at the three cases that can happen in the loop body. If our target value is greater than the current element, we want low = mid + 1, since the leftmost place the target could be is the next element to the right. Same thing if the target value is less than the current element. If the target value is equal to the current element, we want hi = mid since this element could be the one we're looking for, or it could be to the left.

Once the loop exits, there are two cases that we need to deal with. If low > high, we obviously have a search miss. Otherwise, we need to check the remaining element to see if it's equal to our target.

Putting this all together:

int binarySearch(const int &value, const int *a, const int &size)
{
    int low = 0, high = size - 1;
    //Note that these do have to be signed types, since high could be -1
    while(low < high)
    {
        int mid = low + (high - low) / 2;
        //a way to find the average without any chance of overflow
        if(value == a[mid])
        {
            high = mid;
        }
        else if(value < a[mid])
        {
            high = mid - 1;
        }
        else
        {
            low = mid + 1;
        }
    }
    return (low > high || a[low] != value) ? -1 : low;
}

There are other ways that this could be implemented, but I find this way to be the simplest.

eesiraed
  • 4,626
  • 4
  • 16
  • 34
  • Thank you so much for the explanation and the code! You are wonderful! This worked for me, is there an easy way to morph this function to find the last occurrence as well? After Googling a lot it seems like changing one line would typically work to find the last occurrence instead of the first, but I am not sure what to change to find that. – Scranton Branch Apr 09 '18 at 05:41
  • @ScrantonBranch In that case, if we find an equivalent element, we know that this could be the element we're looking for, or it could be to the right. So we just need to change `high = mid;` into `low = mid;`. However, the problem is that if we narrow down to two elements, `low` will actually be equal to `mid`, and we will get an infinite loop. This means we must round the average up instead of down. We don't want to mess with floats here, so the best way probably is to add 1 to `mid` if `low + high` is odd. There might be a better way though. Doing this will also allow `high` to be unsigned. – eesiraed Apr 09 '18 at 23:25
  • @ScrantonBranch If my answer solved your problem, please [accept](https://stackoverflow.com/help/someone-answers) it by clicking on the outline of a checkmark below the up/downvote arrows. It helps me a lot, and it shows that this question has been answered. You can also upvote this answer if you feel like it, but you're not obligated to do any of this. – eesiraed Apr 09 '18 at 23:29
  • @ScrantonBranch Also, just something I overlooked here, we probably shouldn't name something `array` since it might get confused with `std::array`. Although you shouldn't use `using namespace std;` anyways. I'll change it to `a`. – eesiraed Apr 09 '18 at 23:50
  • Thank you for both explanations (and the tips)! I've always used "using namespace std;" in my code mainly because that is the way my professor teaches C++, although he does say it is not the best practice (or it is debated if it is good practice or not), but that is why I use that. As for naming my array "array", I didn't think too hard about that and just typed that out, but I see why that also is not very good practice. Thank you! I accepted your answer and upvoted you, but since I am new, my upvote won't show (I don't know too much about stackoverflow, so thank you again). – Scranton Branch Apr 10 '18 at 23:02
  • @ScrantonBranch For more info on that, see https://stackoverflow.com/q/1452721/9254539. – eesiraed Apr 10 '18 at 23:09