Using the Jaro-Winkler distance class from this answer, which gives priority to matching prefixes, and comparing each abbreviation component to the phrase words (choosing the maximum match to compensate for skipping words), we can write these extensions:
public static class StringExt {
public static double JaroWinklerDistance(this string s1, string s2) => JaroWinkler.proximity(s1, s2);
private static Regex AbbrevSplitRE = new Regex(@" |(?=\p{Lu})", RegexOptions.Compiled);
public static double AbbrevSimilarity(this string abbrev, string phrase) {
var phraseWords = phrase.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
return AbbrevSplitRE.Split(abbrev)
.Where(aw => !String.IsNullOrEmpty(aw))
.Zip(Enumerable.Range(0, phraseWords.Length),
(aw, pwp) => Enumerable.Range(pwp, phraseWords.Length-pwp).Select(n => aw.JaroWinklerDistance(phraseWords[n])).Max()
)
.Sum() / phraseWords.Length;
}
}
Note: The regular expression defines abbreviation components as at each space or capital letter.
Then, we can compare each abbreviation (in abbrevs
) to the original phrase
:
var ans = abbrevs.Select(Abbrev => new { Abbrev, Similarity = Abbrev.AbbrevSimilarity(phrase) });
For your example, I get this answer:
Abbrev | Similarity
Qta VC Sta Parnaiba | 0.65001322751322754
Q V C Sta Pba | 0.60371693121693126
4 VC Sta Parnaiba | 0.53890211640211649
I might add a weight for shorter abbreviations, depending on my final purpose.