0

I need to align three strings in python:

str1 = "This is going to test the function"
str2 = "Th is is going to test function"
str3 = "This is gonna test the functon"

The way I need to align these strings is such that if the word at position x in one list is equal to the word as position x in the other string it it placed at that index. Such that I end up with three lists as outputs matching the following format (can be a pd table aswell):

0 1 2 3 4 5 6 7 8 9 10
This - is - going - to test the function -
- Th is is going - to test - function -
This - is - - gonna - test the - functon

I am extracting text from three OCR models and will use the ordered lists to vote for what word should be in each position.

Thank you

IbbyR
  • 26
  • 7
  • 1
    It doesn't seem like the output will be unique, i.e. there are multiple ways the words could be placed in columns. – Tom Karzes Jul 06 '23 at 21:08
  • @TomKarzes I guess the question is similar to [this answer](https://stackoverflow.com/a/73179873/11567381) but that uses a deprecated align method but only matches two strings. – IbbyR Jul 06 '23 at 21:13
  • What you need is a 3-way diff. You might look at the `merge3` module. https://pypi.org/project/merge3/ – Tim Roberts Jul 06 '23 at 21:52
  • @TimRoberts I tried it there doesn't seen to be much documentation and can't see how it helps in this case? – IbbyR Jul 06 '23 at 22:07
  • After `split`, you will end up with 3 lists of words. What you want is exactly the same as doing a `diff3` against three slightly different source code files. I believe `merge3` will do that task. – Tim Roberts Jul 07 '23 at 05:00
  • @TimRoberts I've taken a look at https://www.breezy-vcs.org/doc/en/user-reference/merge-help.html?highlight=diff3 the documentation and it's not really all that clear to me. I have tried to use merge3 with the split strings and I don't know how to use diff3. AttributeError: module 'merge3' has no attribute 'diff3'. Do you have any documentation on how to use diff3/merge3 that is clear and shows the modules & attributes and the uses? – IbbyR Jul 07 '23 at 15:13
  • How large are your texts? This is not an easy problem. – Tim Roberts Jul 07 '23 at 20:37
  • @TimRoberts I know it's a very tricky problem. The texts are not an issue as I can slice the strings to any length so let's use the example as above for 10 words. But if you really want to know the OCR extracts string page by page and each string is of variable lengths ca. 400 words. – IbbyR Jul 07 '23 at 20:46

0 Answers0