0
String a = 'string'
String b = 'This is a strin'
println b.containsSimilarity(a)

Is there a function like imaginary containsSimilarity in Groovy which to say the differences of strings,so I want to search for "string" in "This is a strin" and after comparison to say that 83% of String "string" is found in "This is a strin". Something like assertions when using Spock

assert "string"=="string"

result is:

"string"=="strin" | false 1 difference (83% similarity) strin(g) strin(-)

How I can do this in Groovy? So not to compare two strings, but to find how big part of String a is containing in String b. If a is part of b ->true, else false and print similarity percentage and show where is the difference.

Xelian
  • 16,680
  • 25
  • 99
  • 152
  • Isn't this related to Levenshtein distance on string? Maybe [this stackoverflow question](http://stackoverflow.com/questions/6087281/similarity-score-levenshtein) has the answer – Will Mar 28 '14 at 09:53
  • Not at all. My string b is very long and I want to search for some small part of it. So if my string b is English Alphabet, and String a is "w" after comparation I want result 100%, but with Levenshtein or Jaro-Walker it will be 0.03% or even less. – Xelian Mar 28 '14 at 10:06
  • Here is one way https://blog.nishtahir.com/2015/09/19/fuzzy-string-matching-using-cosine-similarity/ More relevant would be looking at https://stackoverflow.com/questions/955110/similarity-string-comparison-in-java – Mahesh Pujari Jan 18 '18 at 05:05

2 Answers2

0
​def s1 = "string", s2 = "This is a strin"
def i = 0, j = 0, l1 = s1.size(), l2 = s2.size()

if (l1 >= l2) {
    large = s1
    small = s2
} else {
    large = s2
    small = s1
}

def percent = 100 / small.size()

def match(large, str) {
    if (large.indexOf(str) == -1) {
        return match(large, str.substring(0, str.size() - 1))
    }
    return str.size()
}

println(Math.round(match(large, small) * percent))  //83
​
  • Thanks, for the answer, but if we have s1= "very nice string", s2= "This is very strin" an the result will be 31%, but in fact there are 10 the same letters from 16 - approximately 63%. Because your algorithm cuts from behind. Only get 'very ' 5 leters If we have equal parts between something non equal problem occurs. – Xelian Mar 29 '14 at 06:40
0

I digged some Spock code, using 'similarity' as keyword and soon have found EditDistance class. That class used in Spock for string distance calculation. It depends only on EditPathOperation, so it can be easily extracted.
If you want to pretty printed version, look at EditPathRenderer. It depends on TextUtil.escape method, but it's extractable too.

But note, as Peter Niederwieser documented such classes calculates Levenshtein distance, and you noted, that it isn't exactly what you need. Author is at SO to, so maybe he can add something valuable to my answer.

Seagull
  • 13,484
  • 2
  • 33
  • 45