3

Reading this post about BK Trees, I found the following snippet a bit confusing:

"Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test."

I can intuitively see that if something is d away from the word I'm searching with and I have a tolerance of n error, then I will need at least d-n distance from the word I'm at to "reverse" the differences. Similarly I can have at most d+n because after "reversing" the differences, I can introduce n more differences.

I'm confused how the triangle inequality was used to get this. If we let d(test, query) = d and d(query, found) <= n then by the inequality:

d(test, query) + d(test, nextWordToSearch) >= d(query, found)
d + d(test, nextWordToSearch) >= n

How can we find

d - n <= d(test, nextWordToSearch) <= d + n
xpt
  • 20,363
  • 37
  • 127
  • 216
Duck
  • 178
  • 1
  • 11

3 Answers3

3

Using @templatetypedef's answer, I was able to use the Triangle Inequality for finding both the upper and lower bound.

d(query, desiredNode) = n
d(query, test) = d1

d(query, test) + d(test, desiredNode) >= d(query, desiredNode)
d1 + d(test, desiredNode) >= n
d(test, desiredNode) >= |n - d1|

d(test, query) + d(query, desiredNode) >= d(test, desiredNode)
|d1 + n| >= d(test, desiredNode)

Hence:

|d1 + n| >= d(test, desiredNode) >= |d1 - n|

Absolute values used because of property of non-negative measure.

Duck
  • 178
  • 1
  • 11
1

In what follows, let d be the distance from the query word to the test word (the word at the current node) and n be the maximum distance you're willing to search. You're interested in proving that

n - d ≤ d(test, anyResultingWord) ≤ n + d.

The math you used in your question involving the triangle inequality is sufficient to establish the lower bound. I think the reason you're having trouble with the upper bound is that you don't actually want to use the triangle inequality here.

You actually don't need to use - and in fact, probably shouldn't! - use the triangle inequality to get the upper bound.

Remember that d(x, y) is defined as the Levenshtein distance between x and y, which is the minimum number of insertions, deletions, or substitutions necessary to turn x into y. We want to upper bound d(test, anyResultingWord) at n + d. To do that, notice the following. Starting with the test word, you can convert it to any resulting word as follows:

  • Start by converting it back to the query word, which takes d edits.
  • Convert the query word to the resulting word, which takes n edits.

Overall, this gives a series of n + d total edits needed to convert the test word to the result word. This might be the best way to do it, but it might not be. What we can say is that d(test, anyResultingWord) must be at most n + d, since we know we can convert the test to a resulting word in at most n + d edits. This is where the upper bound comes from - it's not a consequence of the triangle inequality as much as a consequence of how the distance metric is defined.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
1

First of all you have to understand that d obeys the triangle inequality. Let's prove this by contradiction:

Suppose for any 3 arbitrary strings a,b and c we have d(a,c)>d(a,b)+d(b,c), but in that case we can find d(a,c) with d(a,b)+d(b,c) steps, hence we have a contradiction. This is why d obeys the triangle inequality and d(a,c)<=d(a,b)+d(b,c).

Let's now imagine how search through that tree goes. We have a search function f that takes as input Q -- a query and N -- max distance.

Question: Why does that function need to look at edges that are in the segment [d-n,d+n]?

Let's now introduce a couple of other strings. Let x be a string, such that d(Q,x)<=n, let t be the current node we are examining. Clearly, in the above notation d meant d(Q,t).

So, to reformulate the above question, we can ask:

Why d(Q,t)-n<=d(t,x)<=d(Q,t)+n?

For simplicity, let's denote d(Q,t) as a, d(t,x) as b, and d(Q,x) as c.

It is clear from the triangle inequality that

  1. a+b>=c => b>=c-a
  2. a+c>=b
  3. b+c>=a => b>=a-c

From 1. and 3. we can see that b>=|a-c|. So, to put everything together, we get |a-c|<=b<=a+c.

Now, this is not the end of the proof, we also have something to do with 0<=c<=N.

This can be easily done like this: a-N<=a-c<=|a-c|<=b<=a+c<=a+N => a-N<=b<=a+N and since b>=0, we have max(a-N,0)<=b<=a+N.

Coder-Man
  • 2,391
  • 3
  • 11
  • 19