5

I happen to be building the binary search in Python, but the question has more to do with binary search structure in general.

Let's assume I have about one thousand eligible candidates I am searching through using binary search, doing the classic approach of bisecting the sorted dataset and repeating this process in order to narrow down the eligible set to iterate over. The candidates are just strings of names,(first-last format, eg "Peter Jackson") I initially sort the set alphabetically and then proceed with bisection using something like this:

hi = len(names)
lo = 0
while lo < hi:
  mid = (lo+hi)//2
  midval = names[mid].lower()
  if midval < query.lower():
    lo = mid+1
  elif midval > query.lower():
    hi=mid
  else:
    return midval
return None

This code adapted from here: https://stackoverflow.com/a/212413/215608

Here's the thing, the above procedure assumes a single exact match or no result at all. What if the query was merely for a "Peter", but there are several peters with differing last names? In order to return all the Peters, one would have to ensure that the bisected "bins" never got so small as to except eligible results. The bisection process would have to cease and cede to something like a regex/regular old string match in order to return all the Peters.

I'm not so much asking how to accomplish this as what this type of search is called... what is a binary search with a delimited criteria for "bin size" called? Something that conditionally bisects the dataset, and once the criteria is fulfilled, falls back to some other form of string matching in order to ensure that there can effectively be a ending wildcard on the query (so a search for a "Peter" will get "Peter Jacksons" and "Peter Edwards")

Hopefully I've been clear what I mean. I realize in the typical DB scenario the names might be separated, this is just intended as a proof of concept.

Community
  • 1
  • 1
DeaconDesperado
  • 9,977
  • 9
  • 47
  • 77
  • in the worst case it could be all peters, couldn't it? – kdubs Dec 17 '12 at 20:02
  • Indeed, in the worst scenario (or should I say the intended one?) all the Peters would be fetched. – DeaconDesperado Dec 17 '12 at 20:03
  • so it seems you'd have you'd have to bin according to what you might search by. I'm guessing you can do a binary till you found a match, and then do a linear search in both directions to find all the other matches. not sure if I'd call it bins, but you'll have your data organized into a binary tree, and linear. – kdubs Dec 17 '12 at 20:07
  • I suppose one could set a hard coded limit to the binary iterations, sort of like a "logarithmic time to a point" approach, and then just go linear from there. I am just wondering what approaches (if there are any "named" ones out there) one could take given the simple control flow above. At what point would it go linear, and based on what conditions (since those conditionals will never actually match, maybe a call to regex as well?) – DeaconDesperado Dec 17 '12 at 20:10
  • I *think* you need to go linear as soon as you find a match. Not sure what to call this. I'm guessing you don't want to make a "peter" bin and a "sally" bin, etc. Not sure where regex would help. Unless that's what you started the search with, but I think I'm missing your point. – kdubs Dec 17 '12 at 20:13
  • If you look at the IndexOf function in my answer to this question http://stackoverflow.com/questions/10606728/fastest-way-to-search-a-list-of-names-in-c-sharp you can see how it does a binary search that results in the index to the first match (if there are more than one). – hatchet - done with SOverflow Dec 17 '12 at 20:18

2 Answers2

2

I've not come across this type of two-stage search before, so don't know whether it has a well-known name. I can, however, propose a method for how it can be carried out.

Let say you've run the first stage and have found no match.

You can perform the second stage with a pair of binary searches and a special comparator. The binary searches would use the same principle as bisect_left and bisect_right. You won't be able to use those functions directly since you'll need a special comparator, but you can use them as the basis for your implementation.

Now to the comparator. When comparing the list element x against the search key k, the comparator would only use x[:len(k)] and ignore the rest of x. Thus when searching for "Peter", all Peters in the list would compare equal to the key. Consequently, bisect_left() to bisect_right() would give you the range containing all Peters in the list.

All of this can be done using O(log n) comparisons.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • The first comparison should still just be a straight bisection, though, right? So find the first instance of Peter (or just the initial bisection) and then use the comparator to only match however many characters are relevant to the query... – DeaconDesperado Dec 17 '12 at 20:24
  • @DeaconDesperado: The first method looks for an exact match and returns at most one, whereas the second looks for a prefix match and returns a range. You can mix and match these properties as is appropriate for your application... – NPE Dec 17 '12 at 20:28
0

In your binary search you either hit an exact match OR an area where the match would be.
So in your case you need to get the upper and lower boundaries (hi lo as you call them) for the area that would include the Peter and return all the intermediate strings.

But if you aim to do something like show next words of a word you should look into Tries instead of BSTs

Cratylus
  • 52,998
  • 69
  • 209
  • 339
  • Thanks, I think I understand what you are saying... the hi and lo calculation would have to take into account whether or not the subset is entirely made up of good matches. And I get what you mean about the "next words"... already using suffix trees for that. – DeaconDesperado Dec 17 '12 at 20:17
  • `the hi and lo calculation would have to take into account whether or not the subset is entirely made up of good matches` It doesn't have to be so complicated since the elements are already sorted. So all you need is all the elements inside the `low`-`hi` – Cratylus Dec 17 '12 at 20:18