Second attempt at explanation:
Suppose that we're finding the distance between a length-m word and a length-n word. Let the matrix entries be indexed by [0, m] × [0, n], where the (i, j) entry represents the edit distance between the length-i prefix of the length-m word and the length-j prefix of the length-n word.
We can view the dynamic program as finding the shortest path from (0, 0) to (m, n) in a directed graph whose vertices correspond to matrix entries, with length-1 rightward arcs and length-1 downward arcs and length-0 or length-1 diagonal arcs depending on whether the characters at i and j match. The idea, in a nutshell, is to use A* with the length difference heuristic H(i, j) = |(m - i) - (n - j)|. Then, we don't expand nodes whose A* value is greater than d. The result is that only some of the diagonals need to be opened:
o t h e r w o r d
t * * *
h * * *
e * * *
w * * *
o * * *
r * * *
d * * *
First attempt at explanation:
Each matrix entry (i, j) is lowerbounded by |i - j|, because that's a lowerbound on the number of non-diagonal moves needed to reach that state. The strip is every element whose coordinates (i, j) satisfies |i - j| ≤ d, which looks like
o t h e r w o r d
t * * *
h * * * *
e * * * * *
w * * * * *
o * * * * *
r * * * * *
d * * * * *
for d = 2. When the values for the blank elements off of the strip are needed, just use d. At the end, any strip entry that's ≤ d is valid, because the blank elements can only contribute d + 1, seeing as the upper left neighbor of a strip element is also on the strip.
For words of different lengths, we can actually apply the argument to the transpose and restrict to the strip like
o t h e r w o r d
t * * *
h * * *
e * * *
w * * *
o * * *
r * * *
d * * *
though the asymptotic running time is the same.