1

I have two strings say:

s_1 = "This is a bat"
s_2 = "This is a bag"

in a Qualitative manner they could be similar (1) or not (0), in the above case they are not similar because of "g", while in quantitative manner i can see certain amount of dissimilarity is there how can i calculate this dissimilarity of one latter "g" from s_1 to s_2 using python.

I write down one simple code:

Per_deff = float(((Number_of_mutated_sites)/len(s_1))*100)

This code tells us "per_deff" between two string of identical length, what if they are not in identical length. How can i solve my problem.

user3218971
  • 547
  • 1
  • 6
  • 21

4 Answers4

5

Something that you want is similar to Levenshtein Distance. It gives you distance between two strings even if their lengths are not equal.

If two strings are exactly same then distance will be 0 and if they are similar then distance will be less.

Sample Code from Wikipedia:

// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t)
{ int cost;

  /* base case: empty strings */
  if (len_s == 0) return len_t;
  if (len_t == 0) return len_s;

  /* test if last characters of the strings match */
  if (s[len_s-1] == t[len_t-1])
      cost = 0;
  else
      cost = 1;

  /* return minimum of delete char from s, delete char from t, and delete char from both */
  return minimum(LevenshteinDistance(s, len_s - 1, t, len_t    ) + 1,
                 LevenshteinDistance(s, len_s    , t, len_t - 1) + 1,
                 LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
}
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
Pratik Gujarathi
  • 929
  • 1
  • 11
  • 20
2

You can use standard python library difflib

from difflib import SequenceMatcher


s_1 = "This is a bat"
s_2 = "This is a bag"
matcher = SequenceMatcher()
matcher.set_seqs(s_1, s_2)
print matcher.ratio()
Dima Kudosh
  • 7,126
  • 4
  • 36
  • 46
1

What you are looking for is called edit distance.

https://pypi.python.org/pypi/editdistance

Edit distance is the number of edits needed to be made to one string to make it into the other string.

There is also as quick implementation here:

https://stackoverflow.com/a/24172422/4044442

Community
  • 1
  • 1
Jenner Felton
  • 787
  • 1
  • 9
  • 18
1

If I understand you correctly, you want to do fuzzy string matching. Multiple Python libraries exists for this, one of them is fuzzywuzzy.

from fuzzywuzzy import fuzz
s_1 = "This is a bat"
s_2 = "This is a bag"
fuzz.ratio(s_1, s_2)  # returns 92
fuzz.ratio(s_1, s_1)  # returns 100 (max score)
Łukasz Rogalski
  • 22,092
  • 8
  • 59
  • 93