1

I'm trying to compare a full string with it's abbreviated version and return a score accordingly to its similarity.

Here's an example:

Quarta Vara Civel Santana de Parnaiba

And the possible abbreviations

Qta VC Sta Parnaiba

Q V C Sta Pba

4 VC Sta Parnaiba

I tried using FuzzyStrings dll to do so, but when it comes to severe abbreviation like the second and third examples it doesn't work well.

Any ideas on how to deal with this problem?

Community
  • 1
  • 1
  • 1
    Do you have a basic ruleset for exactly how you abbreviate? If so, hardcore that ruleset into a function, and give it input's through parameters. It can work with that ruleset to solve the abbreviation. – Frontear Jan 10 '19 at 18:07
  • I agree with @Frontear, there has to be some kind of standard. You can't program for every conceivable thing a user might do. – Terry Tyson Jan 10 '19 at 18:11
  • No there isn't a pattern/ruleset for abbreviation since the strings comes from documents where I can't control what the user fill in the fields. – Carlos Alexandre Verísimo Jan 10 '19 at 18:16
  • @TerryTyson it doesn't have to get every abbreviation, but at least it should work correctly for most cases – Carlos Alexandre Verísimo Jan 10 '19 at 18:22
  • This sounds like a perfect case for writing the unit tests first. Write the tests so that they only pass if the class works the way you want it to. The tests are the definition of how you want this to work. Then write the code so that the tests pass. – Scott Hannen Jan 10 '19 at 19:16
  • @ScottHannen I'm doing that right now, I posted this question trying to get the answer faster, if by the time I finish there's no asnwer I'll post the awsner here! – Carlos Alexandre Verísimo Jan 10 '19 at 19:32
  • The reason is that from the question there's no way anyone can know what is meant by "similar" or what the score should be. That makes it impossible for anyone to answer the question. A unit test says, "If I compare this string to this abbreviation, I expect to get this score." Hypothetically someone could read the tests, understand the expected behavior, and then write the code. When all the tests pass then the code works as expected. – Scott Hannen Jan 10 '19 at 20:20
  • You need a string similarity algorithm - [this one](https://stackoverflow.com/a/9454016/2557128) might work well, but there are lots of choices. One that emphasized per word prefix matches based on the [Jaro-Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) might be better. – NetMage Jan 10 '19 at 20:30

1 Answers1

4

Using the Jaro-Winkler distance class from this answer, which gives priority to matching prefixes, and comparing each abbreviation component to the phrase words (choosing the maximum match to compensate for skipping words), we can write these extensions:

public static class StringExt {
    public static double JaroWinklerDistance(this string s1, string s2) => JaroWinkler.proximity(s1, s2);

    private static Regex AbbrevSplitRE = new Regex(@" |(?=\p{Lu})", RegexOptions.Compiled);
    public static double AbbrevSimilarity(this string abbrev, string phrase) {
        var phraseWords = phrase.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
        return AbbrevSplitRE.Split(abbrev)
                            .Where(aw => !String.IsNullOrEmpty(aw))
                            .Zip(Enumerable.Range(0, phraseWords.Length),
                                 (aw, pwp) => Enumerable.Range(pwp, phraseWords.Length-pwp).Select(n => aw.JaroWinklerDistance(phraseWords[n])).Max()
                            )
                            .Sum() / phraseWords.Length;
    }    
}

Note: The regular expression defines abbreviation components as at each space or capital letter.

Then, we can compare each abbreviation (in abbrevs) to the original phrase:

var ans = abbrevs.Select(Abbrev => new { Abbrev, Similarity = Abbrev.AbbrevSimilarity(phrase) });

For your example, I get this answer:

      Abbrev        |     Similarity
Qta VC Sta Parnaiba | 0.65001322751322754
Q V C Sta Pba       | 0.60371693121693126
4 VC Sta Parnaiba   | 0.53890211640211649

I might add a weight for shorter abbreviations, depending on my final purpose.

NetMage
  • 26,163
  • 3
  • 34
  • 55