Finding percentage of shared tokens (percent similarity) between multiple strings

Question

I want to find unique strings in a list of strings with a specific percentage (in Python). However, these strings should be significantly different. If there is a small difference between two strings then it's not interesting for me.

I can loop through the strings for finding their similarity percentage but I was wondering if there is a better way to do it?

For example,

String A: He is going to school. 
String B: He is going to school tomorrow.

Let's say that these two strings are 80% similar.

Similarity: The string with exact same words in the same order are most similar. A string can be 100% similar with itself

It's a bit vague definition but it works for my use-case.

@NickChapman Just added a comment for it. I mean exact same words in the same order. So a string is 100% similar to itself. — ut_engr, Jul 06 '17 at 09:34
That's not a sufficient definition. You should provide a formula for the similarity percentage so that I could calculate, for example, the similarity between "She is going to school." and "Is she going to school?". This may be useful: https://en.wikipedia.org/wiki/Category:String_similarity_measures — perigon, Jul 06 '17 at 09:37
@user55449 okay that's helpful. thanks. Regarding these two strings, they should be different because the word-order isn't the same. I need kind of rough estimate. — ut_engr, Jul 06 '17 at 09:39
Your formula should work for any two sentences - otherwise, we're just guessing what you mean. — perigon, Jul 06 '17 at 09:39
See https://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python — alvas, Jul 06 '17 at 09:55

score 1 · Accepted Answer · answered Jul 06 '17 at 09:40

1

If you want to check the amount that two sentences are similar and you want to know when they are the exact same word ordering, then you can use single sentence BLEU score.

I would use the sentence_bleu found here: http://www.nltk.org/_modules/nltk/translate/bleu_score.html

You will need to make sure that you do something with your weights for short sentences. An example from something I have done in the past is

from nltk.translate.bleu_score import sentence_bleu
from nltk import word_tokenize

sentence1 = "He is a dog"
sentence2 = "She is a dog"

reference = word_tokenize(sentence1.lower())
hypothesis = word_tokenize(sentence2.lower())
if min(len(hypothesis), len(reference)) < 4:
        weighting = 1.0 / min(len(hypothesis), len(reference))
        weights = tuple([weighting] * min(len(hypothesis), len(reference)))
else:
    weights = (0.25, 0.25, 0.25, 0.25)
bleu_score = sentence_bleu([reference], hypothesis, weights=weights)

Note that single sentence BLEU is quite bad at detecting similar sentences with different word orderings. So if that's what you're interested in then be careful. Other methods you could try are document similarity, Jaccard similarity, and cosine similarity.

answered Jul 06 '17 at 09:40

Nick Chapman

4,402
1
27
41

Yes, BLEU is meant to be applied at corpus level. You have to look out for [fearing the bleu side](https://gist.github.com/alvations/e5922afa8c91472d25c58b2d712a93e7). BTW, there's more "metrics" that one can consider, google "Semantic Textual Similarity" – alvas Jul 06 '17 at 09:54
I'll try it on my use-case and see how it goes. I'll write the observations here. – ut_engr Jul 06 '17 at 09:55
1

@alvas I know that BLEU is meant to be used at a corpus level. That is why I specified that using BLEU will only be ok if he wants to know if word orderings are exactly the same. I am also aware that there are other similarity metrics, I was just throwing some out there that are pretty easy to get into and understand. – Nick Chapman Jul 06 '17 at 09:56
I really want the order to be same. – ut_engr Jul 06 '17 at 09:58
Great! BTW, just like to add some links so that others reading the question has some references. Not particularly disagreeing with your answer =) Also, it's great to see that the first answer isn't "Levenshtein distance" ;P – alvas Jul 06 '17 at 09:58
@ut_engr, there are other metrics that takes word order into account. RIBES is one =) – alvas Jul 06 '17 at 09:59
I will check that out as well and come back with which one works best for my case :) – ut_engr Jul 06 '17 at 10:47
@Nick This BLEU score worked for me. It's not really good but I am able to remove redundant strings to some extent. – ut_engr Jul 07 '17 at 08:48
@alvas I couldn't find any documentation on how to use RIBES. Can you point me to an example or something? – ut_engr Jul 07 '17 at 08:49
I hope this helps: https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L82 – alvas Jul 07 '17 at 09:04

Finding percentage of shared tokens (percent similarity) between multiple strings

1 Answers1