I'm having two strings
string1 = "apple banna kiwi mango"
string2 = "aple banana mango lemon"
I want the resultant of addition of these two strings (not concatenation) i.e. result should look like
result = "apple banana kiwi mango lemon"
My current approach is rather simple.
- Tokenize the multiline string (the above strings are after tokenization), remove any noises (special/ newline characters/ empty strings)
- The next step is to identify the cosine similarity of the strings, if it is above 0.9, then I add one of the string to final result
Now, here is the problem. It doesn't cover the part where one string contains one half of a word and other contains the other half (or correct word in some cases) of word. I have also added this function in my script. But again the problem remains. Any help on how to move forward with this is appreciated.
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def merge_string(string1, string2):
i = 0
while not string2.startswith(string1[i:]):
i += 1
sFinal = string1[:i] + string2
return sFinal
for item in c:
for j in d:
vec1 = text_to_vector(item)
vec2 = text_to_vector(j)
r = get_cosine(vec1, vec2)
if r > 0.5:
if r > 0.85:
final.append(item)
break
else:
sFinal = merge_string(item, j)
#print("1.", len(sFinal), len(item), len(j))
if len(sFinal) >= len(item) + len(j) -8:
sFinal = merge_string(j, item)
final.append(sFinal)
#print("2.", len(sFinal), len(item), len(j))
temp.append([item, j])
break