Rough string alignment in python

Question

If I have two strings of equal length like the following:

'aaaaabbbbbccccc'
'bbbebcccccddddd'

Is there an efficient way to align the two such that the most letters as possible line up as shown below?

'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'

The only way I can think of doing this is brute force by editing the strings and then iterating through and comparing.

may I ask what you application is? Do you have an example use that is less abstract? — Tadhg McDonald-Jensen, Mar 06 '17 at 20:54
Are all your use cases structured exactly like this? Could you provide examples as to what outputs you'd expect from the function? e.g more on letter combinations and letter order. — Fruitspunchsamurai, Mar 06 '17 at 21:15
@The Nightman This might help: [Approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching#Problem_formulation_and_algorithms) — tkhurana96, Mar 06 '17 at 21:30
You need a Diff algorithm. See http://stackoverflow.com/questions/805626/diff-algorithm — Mark Ransom, Mar 06 '17 at 21:39

TemporalWolf · Accepted Answer · 2017-03-06T21:38:46.523

Return the index which gives the maximum score, where the maximum score is the strings which have the most matching characters.

def best_overlap(a, b):
    return max([(score(a[offset:], b), offset) for offset in xrange(len(a))], key=lambda x: x[0])[1]

def score(a, b):
    return sum([a[i] == b[i] for i in xrange(len(a))])

>>> best_overlap(a, b)
5
>>> a + '-' * best_overlap(a, b); '-' * best_overlap(a, b) + b
'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'

Or, equivalently:

def best_match(a, b):
    max = 0
    max_score = 0
    for offset in xrange(len(a)):
        val = score(a[offset:], b)
        if val > max_score:
            max_score = val
            max = offset
    return max

There is room for optimizations such as:

Early exit for no matching characters
Early exit when maximum possible match found

score 2 · Answer 2 · answered Mar 06 '17 at 21:07

2

I'm not sure what you mean by efficient, but you can use the find method on str:

first = 'aaaaabbbbbccccc'
second = 'bbbebcccccddddd'
second_prime = '-'* first.find(second[0]) + second
first_prime = first + '-' * (len(second_prime) - len(first))
print first_prime + '\n' + second_prime
# Output:
# aaaaabbbbbccccc-----
# -----bbbebcccccddddd

answered Mar 06 '17 at 21:07

aquil.abdullah

3,059
3
21
40

this would make the first letter of the second string line up with the first occurrence in the first string, the question says "_most letters as possible line up_" – Tadhg McDonald-Jensen Mar 06 '17 at 21:13
1

Well in that case more examples of the expected output are needed as well as the criteria for determining how the strings should overlap. The example given `aaaaabbbbbccccc` already aligns`bbbebcccccddddd`. There is no need to pad either side with the '-'. – aquil.abdullah Mar 06 '17 at 21:23

score 1 · Answer 3 · answered Mar 06 '17 at 21:31

I can't see any other way than brute forcing it. The complexity will be quadratic in the string length, which might be acceptable, depending on what string lengths you are working with.

Something like this maybe:

def align(a, b):
    best, best_x = 0, 0
    for x in range(len(a)):
        s = sum(i==j for (i,j) in zip(a[x:],b[:-x]))
        if s > best:
            best, best_x = s, x
    return best_x

align('aaaaabbbbbccccc', 'bbbebcccccddddd')
5

score 0 · Answer 4 · answered Mar 06 '17 at 21:30

I would do something like the binary & function on each of your strings. Compares each of the strings when they are lined up, counting up the number of times letters match. Then, shift by one and do the same thing, and go on and on with shifting until they are no longer lined up. The shift with the most matching letters in this fashion is the correct output shift, and you can add the dashes when you print it out. You don't actually have to modify the strings for this, just count the number of shifts and offset your comparing of the characters by that shift amount. This is not terribly efficient (O(n^2) = n+(n-2)+(n-4)...), but is the best I could come up with.

Rough string alignment in python

4 Answers4