Merge Similar Strings Python

Question

I'm having two strings

string1 = "apple banna kiwi mango"
string2 = "aple banana mango lemon"

I want the resultant of addition of these two strings (not concatenation) i.e. result should look like

result = "apple banana kiwi mango lemon"

My current approach is rather simple.

Tokenize the multiline string (the above strings are after tokenization), remove any noises (special/ newline characters/ empty strings)
The next step is to identify the cosine similarity of the strings, if it is above 0.9, then I add one of the string to final result

Now, here is the problem. It doesn't cover the part where one string contains one half of a word and other contains the other half (or correct word in some cases) of word. I have also added this function in my script. But again the problem remains. Any help on how to move forward with this is appreciated.

def text_to_vector(text):
     words = WORD.findall(text)
     return Counter(words)

def get_cosine(vec1, vec2):
     intersection = set(vec1.keys()) & set(vec2.keys())
     numerator = sum([vec1[x] * vec2[x] for x in intersection])

     sum1 = sum([vec1[x]**2 for x in vec1.keys()])
     sum2 = sum([vec2[x]**2 for x in vec2.keys()])
     denominator = math.sqrt(sum1) * math.sqrt(sum2)

     if not denominator:
        return 0.0
     else:
        return float(numerator) / denominator


def merge_string(string1, string2):
    i = 0
    while not string2.startswith(string1[i:]):
        i += 1

    sFinal = string1[:i] + string2
    return sFinal

for item in c:
for j in d:
    vec1 = text_to_vector(item)
    vec2 = text_to_vector(j)
    r = get_cosine(vec1, vec2)
    if r > 0.5:
        if r > 0.85:
            final.append(item)
            break
        else:
            sFinal = merge_string(item, j)
            #print("1.", len(sFinal), len(item), len(j))
            if len(sFinal) >= len(item) + len(j) -8:
                sFinal = merge_string(j, item)
                final.append(sFinal)
                #print("2.", len(sFinal), len(item), len(j))
                temp.append([item, j])
                break

@JackDaniels yes, you need to show us what you've tried already. — ruohola, Sep 17 '18 at 13:14
I don't think there is a proper way to do this without having either a dictionary of correctly spelled words, or more than two lists, so you could apply some kind of "majority vote" for which variant to pick. — tobias_k, Sep 17 '18 at 13:28

LetzerWille · Accepted Answer · 2018-09-18T14:12:47.650

4

The difficult part is to check if the word is a valid English word.

For this either you have to have a dictionary to check the word against, or use nltk.

     pip install nltk  

     from nltk.corpus import wordnet  

     set([w for w in (string1 + string2).split() if  wordnet.synsets(w)]) 

     Out[41]: {'apple', 'banana', 'kiwi', 'lemon', 'mango'}

To catch digits, if present, add isdigit().

st1 = 'Includes Og Added Sugars'

st2 = 'Includes 09 Added Sugars 09'


set([w for w in (st1 + st2).split() if  (wordnet.synsets(w) or w.isdigit())])

Out[30]: {'09', 'Added', 'Includes', 'Sugars'}

To catch abbreviations like g, mg add re.match().

set([w for w in (st1 + st2).split() if  (wordnet.synsets(w) or w.isdigit() or re.match(r'\d+g|mg',w))])

Out[40]: {'09', '0g', 'Added', 'Includes', 'Sugars'}

edited Sep 18 '18 at 14:12

answered Sep 17 '18 at 13:36

LetzerWille

5,355
4
23
26

1

This seems to be (close to) the right approach, but the result is odd. You never "fix" `w`, so how are there two "banana" in the result, and where is the second correctly spelled "mango"? – tobias_k Sep 17 '18 at 13:44
@tobias_k. Thank you for your comment. I missed repeating words.. Changing to set, does it... – LetzerWille Sep 17 '18 at 13:47
I'll try this. I have already tried symspell but that didn't help much. But will give it a shot, and will update here. – Jack Daniels Sep 17 '18 at 19:21
@LetzerWille It doesn't seem to work. Like in this case ['Includes Og Added Sugars/', 'Includes 09 Added Sugars 09'], the combination doesn't seem to work. Also I need the resultant in order as it is. I was hoping for more of a dynamic solution rather than a static. I have tried [symspell](https://github.com/mammothb/symspellpy) for correction, but didn't work as well. Anyways, not looking for a correction as such, just need the final output as in the example in question above. – Jack Daniels Sep 18 '18 at 05:30
@JackDaniels, it worked for me. st1 = 'Includes Og Added Sugars' st2 = 'Includes 09 Added Sugars 09' set([w for w in (st1 + st2).split() if wordnet.synsets(w)]) Out[22]: {'Added', 'Includes', 'Sugars'} – LetzerWille Sep 18 '18 at 13:10
@LetzerWille it covered only the text, say I have some numbers as well in the string. I need a resultant of both strings (and not just for text). – Jack Daniels Sep 18 '18 at 13:27
@JackDaniels, just add isdigit() set([w for w in (st1 + st2).split() if (wordnet.synsets(w) or w.isdigit())]) Out[30]: {'09', 'Added', 'Includes', 'Sugars'} – LetzerWille Sep 18 '18 at 13:36
This seems the closest I have come. Thanks. – Jack Daniels Sep 19 '18 at 07:16

score 1 · Answer 2 · answered Sep 17 '18 at 13:19

1

Have you ever heard of Levenshtein's distance? I suggest the following algorithm:

Split the lists into elements (string1.split(" "))
Loop through list(string1). Inside it loop through list(string2) and if Levenshtein's distance for the two elements is say, less than 3, push the element to the result array.
Return result.

for i in list(string1): for k in list(string2): if levenshtein(i,k) < 3: res.append(i)

answered Sep 17 '18 at 13:19

Kurns

157
10

I tried this as well as soundex. But thing is which word should I push, pushing the bigger word of two is not always the answer here. – Jack Daniels Sep 17 '18 at 13:23

Merge Similar Strings Python

2 Answers2