How to find a similar substring inside a large string with a similarity score in python?

Question

What I'm looking for is not just a plain similarity score between two texts. But a similarity score of a substring inside a string. Say:

text1 = 'cat is sleeping on the mat'.

text2 = 'The cat is sleeping on the red mat in the living room'.

In the above example, all the words of text1 are present in the text2 completely, hence the similarity should be 100%.

If some words of text1 are missing, the score shall be less.

I'm working with a large dataset of varying paragraph size, hence finding a smaller paragraph inside a bigger one with such similarity score is crucial.

I found only string similarities such as cosine similarities, difflib similarity etc. which compares two strings. But not about a score of substring inside another string.

This question is about [tag:algorithms] rather than [tag:python]. — Peter Wood, Jan 05 '18 at 16:30
This might be a duplicate to https://stackoverflow.com/questions/17388213/find-the-similarity-percent-between-two-strings But I would check out that library for comparing two strings. Then you will need to find a way to find the sub string. It might work with sub strings. Havent tested to try that out though. — Zack Tarr, Jan 05 '18 at 16:49
Have you tried [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)? It may be useful here, included it in my answer as well. — DarkCygnus, Jan 05 '18 at 18:33
See also: https://stackoverflow.com/questions/15173225/calculate-cosine-similarity-given-2-sentence-strings — alvas, Jan 06 '18 at 16:54
Dupe of https://stackoverflow.com/questions/17388213/find-the-similarity-percent-between-two-strings too? — alvas, Jan 06 '18 at 16:55

DarkCygnus · Accepted Answer · 2018-01-05T18:32:23.733

7

Based on your description, how about:

>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
        if word in b: #if it is in your bigger string increase score
            score += 1
>>> score/len(a) #obtain percentage given total word number
1.0

In case it had a missing word for example:

>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
        if w in b:
            score +=1
>>> score/len(c)
0.875

Additionally, you can do as @roadrunner suggest and split b and save it as a set to speed up your performance with b = set(b.split(" ")). This will reduce that part complexity to O(1) and improve the overall algorithm to a O(n) complexity.

Edit: You say you already tried some metrics like Cosine Similarity etc. However I suspect you may benefit from checking the Levenshtein Distance similarity, which I suspect could be of some use in this case as addition to the solutions provided.

edited Jan 05 '18 at 18:32

answered Jan 05 '18 at 16:49

DarkCygnus

7,420
4
36
59

2

Nice answer, definitely the most simplest approach. I think you need to split `b` here though? – RoadRunner Jan 05 '18 at 18:16
1

@RoadRunner thanks. This works without having to split `b` (pasted directly from my console). However, you *can* split it if you want the performance enhancement, which would be better I guess. – DarkCygnus Jan 05 '18 at 18:19
2

Yeah you're right. This gives the OP a good starting point, he/she can choose what to use and what not to use, depending on the text he/she is working with. – RoadRunner Jan 05 '18 at 18:25
Thank you very much DarkCygnus and @RoadRunner. This and the above one, are a perfect fit for what I was looking for. I tried Levenshtein distance in my first attempt, but since my Text2 has unwanted text around it, I would get a lower similarity score when compared with Text1. Hence was looking for an alternative. – Sanjay Kamath Jan 07 '18 at 21:27
1

Clad I could help @SanjayKamath :) consider accepting an answer if it solved your problem once you've tried. Cheers – DarkCygnus Jan 08 '18 at 00:58

RoadRunner · Answer 2 · 2018-01-05T18:27:22.187

4

You could also use a collections.defaultdict to store the counts of words in word_a that exist in word_b, then sum() the counts divided by the length of word_a at the end:

from collections import defaultdict

a = "the cat is not sleeping on the mat"
b = "the cat is sleeping on the red mat in the living room"

word_a = a.split()
word_b = set(b.split())

d = defaultdict(int)
for word in word_a:
    if word in word_b:
        d[word] += 1

print(sum(d.values()) / len(word_a))

Which Outputs:

0.875

Note: Since we only care about checking if words in word_a exist in word_b, then converting word_b to a set() will allow O(1) lookup, instead of keeping it a list, which will be O(n). This then makes the overall time complexity of the above code O(n).

edited Jan 05 '18 at 18:27

answered Jan 05 '18 at 17:52

RoadRunner

25,803
6
42
75

1

Interesting addition. So you say `word in word_b` is going to be O(1) if it is a set then? Still, you are doing `for word in word_a`, so that makes your algorithm O(n) right? – DarkCygnus Jan 05 '18 at 18:00
2

@DarkCygnus Yeah I agree, this solution is `O(n)`. I only added the set lookup optimization to slightly improve performance. I might have to run some tests on some bigger samples to see if it really makes much of a difference. – RoadRunner Jan 05 '18 at 18:04
1

Indeed. OP can also tailor these options to fit his needs depending on the data he is handling. Cheers – DarkCygnus Jan 05 '18 at 18:06

Carlos Fernández · Answer 3 · 2018-01-05T17:19:30.250

2

Similar to DarkCygbus but the similarity is based on its count total character instead of words. On the other hand, this script has only checked coincidence with the complete words (text_2.split())

from __future__ import division

text_1 = 'cat is sleeping on the mat'
text_2 = 'The cat is sleeping on the red mat in the living room'
no_match = 0
match = 0

for word in text_1.split():
    if word not in text_2.split():
        no_match += len(word)
    else:
        match += len(word)

similarity = match/(match + no_match)
print ('{0:.0%}'.format(similarity))

edited Jan 05 '18 at 17:19

answered Jan 05 '18 at 17:11

Carlos Fernández

99
1
4

1

Although this seems like an alternative approach, OP seems to clearly ask about some way of considering similarities between *words*, as he mentions it several times in the Question, i.e.: "But not about a score of substring inside another string.", "In the above example, the words...is present in the Text2 completely, hence the similarity should be 100%." – DarkCygnus Jan 05 '18 at 17:26
1

Besides, I suspect that your approach may give some unexpected behavior. If you compare *by characters* instead of by words, you may get 100% match in things like "**asdf**" and "**as** you **d**id **f**irst", as it contains all characters, even though they are completely different things. – DarkCygnus Jan 05 '18 at 17:28
I don't think so, if you check in your script 'a' like a = "cat is sleeping on the mat lee ivi" .. its similarity will be 100%, in my script 81%. – Carlos Fernández Jan 05 '18 at 17:53
Words like 'ivi' and 'lee' don't make sense and probably are not present on typed text analysis. Still, my last comment is true, but I agree that my approach may have different results if unrealistic words are present. Should be up to OP to decide what fits his needs better. – DarkCygnus Jan 05 '18 at 18:09
With the word 'ping' happen the same and it isn't unrealistic word ... I think that you only change word_b.split() instead of word_b (in if) and it will be right. – Carlos Fernández Jan 05 '18 at 18:15
You said "you may get 100% match in things like "asdf" and "as you did first", as it contains all characters, even though they are completely different things" .... and that is wrong, If you look my code, I check complete word but the similiraty value uses size the word. – Carlos Fernández Jan 05 '18 at 18:19
I said I "suspect" that may happen. Your approach is also valid like I said, but its up to OP to see what fits his needs better. I will leave the subject here, as comments are not for extended discussion. – DarkCygnus Jan 05 '18 at 18:21

0x4f · Answer 4 · 2023-08-07T02:53:09.583

I think this can be achieved with Levenshtein distancing combined with substring matching. What you can do is split a sentence into smaller words (using spaces as a separator), then run the Levenshtein matching algorithm to match individual words with your search string. Something like:

def similar_word(string, substring):
    threshold=2

    def levenshtein_distance(s1, s2):
        m, n = len(s1), len(s2)
        dp = [[0] * (n + 1) for _ in range(m + 1)]

        for i in range(m + 1):
            for j in range(n + 1):
                if i == 0: dp[i][j] = j
                elif j == 0: dp[i][j] = i
                elif s1[i - 1] == s2[j - 1]: dp[i][j] = dp[i - 1][j - 1]
                else: dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])
        return dp[m][n]

    for i in range(len(string) - len(substring) + 1):
        distance = levenshtein_distance(string[i:i + len(substring)], substring)
        if distance <= threshold: return True
    
    return False

https://gist.github.com/4f77616973/66a784c4c5921359299d603419a8f01b

Since you want the score, you can tweak the above code to return the distance instead of True/False.

Hope that helps! :)

How to find a similar substring inside a large string with a similarity score in python?

4 Answers4

Linked