-1

I have a large amount of pairs of strings, for example:

s1 = 'newyork city lights are yellow'
s2 = ' the city of new york is large'

I would like to write a function that gets s1 and s2 (regardless of the order) and outputs:

s1_output = 'new york city lights are yellow'
s2_output = 'the city of new york is large'

such that the newyork in s2 is separated into new york or at least, a way to find the element that is matching other elements in the second string with only one character insertion.

The matched tokens are not known in advance and are not mandatory in the text Any ideas?

Latent
  • 556
  • 1
  • 9
  • 23
  • maybe something like `s.replace('newyork', 'new york').strip()`? – rv.kvetch Sep 19 '21 at 15:56
  • Its an example.. you dont know the elements in advance – Latent Sep 19 '21 at 16:12
  • why do we want to replace `newyork` with `new york` in this case? I guess that part wasn't really clear to me – rv.kvetch Sep 19 '21 at 16:29
  • 1
    given i have two strings where there is a clear fuzzy match between one of the elements in them (i.e Base ball and "baseball") I want to find a way to extract that element and normalize both texts to the same format . – Latent Sep 19 '21 at 16:40
  • 1
    does this answer your question? https://stackoverflow.com/a/50534532/10237506 – rv.kvetch Sep 19 '21 at 17:12
  • no, because in this solution you need to know what to look for (i.e lion) in order to search for it in the other string, I am asking for a generic approach that will search for any s1 and s2 if there is an element that can be matched just by insertion of whitespace, similar to levinstain distance but in a token manner – Latent Sep 19 '21 at 17:30
  • What have you tried - edit the code of your attempt to solve this into your question as a [mre] and explain what’s wrong. – DisappointedByUnaccountableMod Sep 19 '21 at 20:40

1 Answers1

1

Something like this can work

s1 = 'newyork city lights are yellow'
s2 = ' the city of new york is large'

# Get rid of leading/trailing whitespace
s1 = s1.strip()
# Split string into list of words, delimeter is ' ' by default
words_s1 = s1.split()

s2 = s2.strip()
words_s2 = s2.split()

# For each word in list 1, compare it to adjacent (concatenated) words in list 2
for word in words_s1:
    for i in range(len(words_s2)-1):
        if word == words_s2[i] + words_s2[i+1]:
            print(f"Word #{words_s1.index(word)} in s1 matches words #{i} and #{i+1} in s2")

It works to match up words in the way you described. Basically the idea is you loop through list 1 and check it against adjacent words in list 2.

You could also then loop the opposite way (loop thru s2 and check if it's equal to adjacent words in s1), to check both directions.

You'd need to keep track of where the matches are, and then you just need to build a new string with that info.

hawruh
  • 185
  • 8