1

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?

Example of input:

string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'

Desired output:

alignment['my']        = 'my'
alignment['channel']   = 'channel'
alignment['is']        = 'is'
alignment['youtube']   = 'youtube.com/example'
alignment['dot']       = 'youtube.com/example'
alignment['com']       = 'youtube.com/example'
alignment['slash']     = 'youtube.com/example'
alignment['example']   = 'youtube.com/example'
alignment['and']       = 'and'
alignment['then']      = 'then'
alignment['I']         = 'I'
alignment['also']      = 'also'
alignment['do']        = 'do'
alignment['live']      = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on']        = 'on'
alignment['twitch']    = 'twitch'
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
  • Why is that the "most likely" alignment? It aligns `my channel is youtube` at the front, but there's another option that would align `and then I also do livestreaming on twitch` at the end, which is a longer match (with the exception of the space in `live streaming`) – John Gordon Jul 30 '22 at 23:42
  • @JohnGordon thanks, doesn't the example also align `and then I also do livestreaming on twitch` at the end? – Franck Dernoncourt Jul 30 '22 at 23:47
  • I recommend removing the search tag `nlp` (non-linear programming) from this post. `nlp` is about maximizing or minimzing a function subject to certain constraints. For example minimizing the cost of installing tile in someone bathroom floor subject to the constraint that a whole number of boxes of tiles must be purchased and that the square footage of tile purchase must be at least 341.81 square feet. – Toothpick Anemone Jul 30 '22 at 23:48
  • 2
    @SamuelMuldoon the tag NLP is used for natural language processing – Franck Dernoncourt Jul 30 '22 at 23:49

3 Answers3

3

Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.

I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:

from Bio import pairwise2
from Bio.pairwise2 import format_alignment


string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'

alignments = pairwise2.align.globalxx(string1.split(), 
                                      string2.split(),
                                      gap_char=['-']
                                     )

The resulting alignments - pretty close already:

>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example          -          and then I also do live streaming       -       on twitch. 
 |    |     |                                                    |    |  |   |   |                               |    |    
my channel is    -     -   -    -      -    youtube.com/example and then I also do  -       -     livestreaming on twitch. 
  Score=10

You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.

fsimonjetz
  • 5,644
  • 3
  • 5
  • 21
3

Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).

Erwan
  • 1,385
  • 1
  • 12
  • 22
1

The Needleman-Wunch Algorithm

Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.

MOUSE: A  A  T  C  C  G  C  T  A  G  
RAT:   A  A  A  C  C  C  T  T  A  G  
       +  +  -  +  +  -  -  +  +  + 

Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.

You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.

I recommend using the the Needleman Wunsch Algorithm

Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.

In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.

+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
|  The  | reason | that  | I | went | to | the | store | was | to | buy |  some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went |  2 | te  | store | wuz |  2 | buy | <GAP> | fud  |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+

The special GAP characters are fine.

However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:

ALIGNMENT ONE

+---+-------+-------+---+---+
| A |   1   | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B     | R | A |
+---+-------+-------+---+---+

ALIGNMENT TWO

+---+-------+-------+---+---+
| A | <GAP> | 1     | R | A |
+---+-------+-------+---+---+
| A | B     | <GAP> | R | A |
+---+-------+-------+---+---+  

However, if you have two or more gaps in a row, then order of the gaps should not matter.

The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.

The following two alignments have the same score.

Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).

However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.

Toothpick Anemone
  • 4,290
  • 2
  • 20
  • 42