Calculate word overlap from two words in two directions in python

Question

So I have two words. I want to write a function that finds the maximum overlap going from each word to the other. Example:

words = ['AAB', 'BAA']
find_overlap('AAB', 'BAA')

Should output B and size 1, and:

find_overlap('BAA', 'AAB')

Should output AA and size 2. Any suggestions on how to do it?

Edit: So I tried difflib.SequenceMatcher from python, but I don't understand the output.

s1 = "AAB"
s2 = "BAA"
s = difflib.SequenceMatcher(None, s1, s2)
pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
print(pos_a, pos_b, size)

Check the largest possible overlap first, and iterate to smaller and smaller overlaps. — Moberg, May 05 '17 at 15:21
@Moberg That suggests that to check for the largest overlap of the complete text of *War and Peace* and *Crime and Punishment* we would start by checking for overlaps of several hundred thousand characters then work our way down. Doesn't sound very efficient. The overlap is probably of length 0. — John Coleman, May 05 '17 at 15:33
@JohnColeman No, not very efficient. Although it means finding the first occurence of the first letter of CaP in WaP and working your way down. Of cource skipping to next occurence as soon as they don't match. What other ways are there? :) — Moberg, May 05 '17 at 15:39
Even though `difflib` looks like it should be relevant I'm not sure if it really is. `find_longest_match` doesn't seem helpful unless the longest match happens to occur at the end of one string and simultaneously at the beginning of the other. — John Coleman, May 05 '17 at 18:11
@John Coleman It's what I could find from other threads on stackoverflow. Any suggestions for another built-in or really simple function that can give me the overlap such that overlap(AAB, BAA) = B and overlap(BAA, AAB) = AA? — Lora Sinha, May 05 '17 at 18:20
@John Coleman I want a solution that works on whatever length. It should be simple, but not too naive. — Lora Sinha, May 05 '17 at 18:48

score 0 · Answer 1 · edited May 23 '17 at 10:31

0

For shorter strings a naïve approach is probably adequate. For example, the idea of @Moberg could be implemented like this

def largest_overlap(s1,s2):
    n = min(len(s1),len(s2))
    for i in range(n,0,-1):
        if s2.startswith(s1[-i:]):
            return s1[-i:]
    return ''

Some test cases:

print("BAA, AAB =>", largest_overlap("BAA", "AAB"))
print("AAB, BAA =>", largest_overlap("AAB", "BAA"))
print("AAA, BB =>", largest_overlap("AAA", "BB"))
print("AA, AABB =>", largest_overlap("AA", "AABB"))
print("hello world, world peace =>", largest_overlap("hello world", "world peace"))

Output:

BAA, AAB => AA
AAB, BAA => B
AAA, BB => 
AA, AABB => AA
hello world, world peace => world

For longer strings you might need a more sophisticated algorithm, something similar to this.

edited May 23 '17 at 10:31

Community

1
1

answered May 05 '17 at 17:28

John Coleman

51,337
7
54
119

Thanks. How can I modify it to output the whole word with no overlap, For example with BAA, AAB => AA , I also want BAAB. – Lora Sinha May 05 '17 at 19:29
@LoraSinha if `s` is the returned overlap, then `s1 + s2[len(s):]` is the two words concatenated with the overlap only appearing once. – John Coleman May 05 '17 at 19:52

Calculate word overlap from two words in two directions in python

1 Answers1