how to find similarity between many strings and plot it

Question

I have a xls file with one column and 10000 strings I want to do few things

1- make a heatmap or a cluster figure shows the similarity percentage between each string with another one.

In order to find the percentage of similaity between one with another, I found this post Find the similarity percent between two strings and I tried to make it work for me

As an example, I have these in a xls file where each line is one string

AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR
AAAAAGPLQPETENAGTSV
AAAAANNGAAPPDLSLMALAR
AAAAASAVNDYYGTWGQK
AAAAASGASNTDSSATKPK
AAAAGFNWDDADVK
AAAAGFNWDDADVKK

I could not figure out how to use that example, for when I have many combinations for example in my example , I have 7 strings and each one has a similarity with another one.

import xlrd
from difflib import SequenceMatcher

workbook = xlrd.open_workbook('data.xlsx')
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()

You might need to rephrase it. The example you found is not good enough because it shows similarity between two strings and you need similarity of each string with every other one? — postoronnim, Nov 29 '16 at 18:08
@user3718294 thanks for your message, Yes exactly . you can modify my question, I will definitely accept — nik, Nov 29 '16 at 18:09
@Bill Bell they are from an experiment. each of them shows a specific molecule — nik, Nov 29 '16 at 18:45
Then if you were to compare any two of these strings with their molecular similarity in mind how would you assign a number indicating their similarity? — Bill Bell, Nov 29 '16 at 18:53
@Bill Bell number of letter from sting A which is similar to string B divided by the number of unsimilar letter from A to B — nik, Nov 29 '16 at 18:55
Do I understand: the number of letters that A and B have in common divided by the number of letters that are in one string but not the other? What happens when the denominator is zero? — Bill Bell, Nov 29 '16 at 19:01
@Bill Bell Yes, when the result is 1, it means they are both identical. when they dont have any similarity , it will be 0 — nik, Nov 29 '16 at 19:17

score 1 · Answer 1 · answered Nov 29 '16 at 18:18

Since you are working on text data you can use sklearn.metrics.pairwise.cosine_similarity that Compute cosine similarity between samples in X and Y. You can also use sklearn.feature_extraction.text.{TfidfVectorizer, CountVectorizer} to convert a collection of raw documents to numerical vectors Here's some code:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
text_data = load_function(...)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(text_data)
similarities = cosine_similarity(X_train_counts)
percent_sim = similarities * 100

score 1 · Answer 2 · answered Nov 29 '16 at 18:27

What you are describing is called String Matching which is class of problems for which many solutions exist:

Levenshtein
Jaccard
Jaro
Affine Gap
NeedlemanWunch
Smith-Waterman
Cosine Similarity
Euclidian

Each of them have characteristics that suite certain problems. A lot of innovation in this field is fueled by genomics research where two sets of DNA sequences need to be compared. That list should give you a starting point to find a solution that matches your needs. But Levenshtein is the most common for most use cases.

Bill Bell · Accepted Answer · 2016-11-29T22:44:12.007

Taking two of the strings from your list as a sample, I offer this way of calculating a measure.

>>> from collections import Counter
>>> stringA = 'AAAAAGPLQPETENAGTSV'
>>> stringB = 'AAAAANNGAAPPDLSLMALAR'
>>> unionSize = len(stringA) + len(stringB)
>>> A=Counter(list(stringA))
>>> B=Counter(list(stringB))
>>> A
Counter({'A': 6, 'G': 2, 'E': 2, 'T': 2, 'P': 2, 'V': 1, 'Q': 1, 'S': 1, 'N': 1, 'L': 1})
>>> B
Counter({'A': 9, 'L': 3, 'N': 2, 'P': 2, 'G': 1, 'M': 1, 'S': 1, 'R': 1, 'D': 1})
>>> symDiff = set(A.keys()).symmetric_difference(set(B.keys()))
>>> symDiff
{'M', 'V', 'Q', 'E', 'T', 'D', 'R'}
>>> symDiffSize = 0
>>> for key in symDiff:
...     if key in A.keys():
...         symDiffSize += A[key]
...     else:
...         symDiffSize += B[key]
...         
>>> symDiffSize, unionSize
(9, 40)

If the two strings had all letters in common then there would be no letters in their 'symmetric difference', which would make the denominator zero. This would seem to mean that the more letters in common and the fewer that are unshared the greater the fraction. You could perhaps take its logarithm.

I don't have Excel. This code accepts a list of strings which you could glean from Excel. It avoids redundant calculations of multisets (aka bags) to save resources. Also, it returns a pair, rather than a ratio because sometimes the denominator can be zero.

from collections import Counter

strings = [
    'AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR', 
    'AAAAAGPLQPETENAGTSV', 
    'AAAAANNGAAPPDLSLMALAR', 
    'AAAAASAVNDYYGTWGQK', 
    'AAAAASGASNTDSSATKPK', 
    'AAAAGFNWDDADVK', 
    'AAAAGFNWDDADVKK', 
    ]

class NikDistance():
    def __init__ (self, strings):
        self.stringLengths = [len(str) for str in strings]
        self.stringCounters = []
        for str in strings:
            self.stringCounters.append(Counter(list(str)))
    def __call__ (self, i, j):
        unionDiff = self.stringLengths[i] + self.stringLengths[j]
        symDiff = set(self.stringCounters[i].keys()).symmetric_difference(set(self.stringCounters[j].keys()))
        symDiffSize = 0
        for key in symDiff:
            if key in self.stringCounters[i].keys():
                symDiffSize += self.stringCounters[i][key]
            else:
                symDiffSize += self.stringCounters[j][key]
        return (symDiffSize, unionDiff)

nikDistance = NikDistance(strings)

for i in range(len(strings)):
    for j in range(i+1, len(strings)):
        print (strings[i], strings[j], nikDistance(i,j))

Result:

AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAAGPLQPETENAGTSV (7, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAANNGAAPPDLSLMALAR (11, 54)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASAVNDYYGTWGQK (9, 51)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASGASNTDSSATKPK (9, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVK (13, 47)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVKK (14, 48)
AAAAAGPLQPETENAGTSV AAAAANNGAAPPDLSLMALAR (9, 40)
AAAAAGPLQPETENAGTSV AAAAASAVNDYYGTWGQK (10, 37)
AAAAAGPLQPETENAGTSV AAAAASGASNTDSSATKPK (8, 38)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVK (15, 33)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVKK (16, 34)
AAAAANNGAAPPDLSLMALAR AAAAASAVNDYYGTWGQK (14, 39)
AAAAANNGAAPPDLSLMALAR AAAAASGASNTDSSATKPK (9, 40)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVK (12, 35)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVKK (13, 36)
AAAAASAVNDYYGTWGQK AAAAASGASNTDSSATKPK (6, 37)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVK (6, 32)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVKK (6, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVK (10, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVKK (10, 34)
AAAAGFNWDDADVK AAAAGFNWDDADVKK (0, 29)

Consider the last item. There are 29 characters altogether, and there are no (zero) characters that don't appear in both strings.

Look at the penultimate item. There are a total of 34 characters. Ten (10) of them do not appear in both strings.

I liked your answer already. However, the problem still remains because when I have so many strings, it is difficult to calculate, if you look at my question. I asked how I do it for as many strings as I have? if you modify your question. I would definitely accept it — nik, Nov 29 '16 at 19:35
it definitely does the trick, however, something is not clear, each time I get two integer values for example (0, 66) , what is 0 and what is 66? also would it be possible to show which string with which string results in which values ? for example A1 A2 : (0, 66) , A1 A3: (7, 52) etc — nik, Nov 29 '16 at 21:52
I changed the final line to show the strings that yielded the results. There's also an explanation of the results at the end. — Bill Bell, Nov 29 '16 at 22:44

score 0 · Answer 4 · answered Nov 29 '16 at 18:26

You need to use a nested loop, like this:

    L = ['NIVSDSTQASR', 'QPETENAGTSV', 'AAPPDLSLMALAR', 'AASAVNDYYGTWGQK']
    T = []
    for i in L:
        for j in L:
            if i != j:
                T.append((i+j))
    for k in T:
    print k

Just substitute "append" with your sequence matcher function.

how to find similarity between many strings and plot it

4 Answers4

Linked