fuzzy string compare (check for shorthand matching) C#

Question

I have two lists of string, and I want to extract from each list the index if the string at current index is in the second list(and vice versa), the string cant match exactly or can be a shorthand of another list, for example, consider this two list

List<string> aList = new List<string> { "Id", "PartCode", "PartName", "EquipType" };
List<string> bList = new List<string> { "PartCode", "PartName", "PartShortName", "EquipmentType" };

in the above example, I want from aList the indexes: 1,2,3

and from bList indexes 0,1,3

indexes 1,2 from aList are obvious the string matched completely, but the interesting part are "EquipType" and "EquipmentType" which match becuse "EquipType" is a shorthand of "EquipmentType"

but "PartName" is not a shorthand of "PartShortName" so there indexes are not needed

these is my code

List<string> aList = new List<string> { "Id", "PartCode", "PartName", "EquipType" };// 1, 2 , 3
List<string> bList = new List<string> { "PartCode", "PartName", "PartShortName", "EquipmentType" };//0, 1 ,3 

List<int> alistIndex = new List<int>();
List<int> blistIndex = new List<int>();

for (int i = 0; i < aList.Count; i++)           
{
    string a = aList[i];
    for (int j = 0; j < bList.Count(); j++)               
    {
        string b = bList[j];

        string bigger, smaller;
        int biggerCount, smallerCount;
        if (a.Length > b.Length)
        {
            bigger = a; smaller = b;
            biggerCount = a.Length ; smallerCount = b.Length ;    
        }
        else
        {
            bigger = b; smaller = a;
            biggerCount = b.Length; smallerCount = a.Length ;
        }

        int countCheck = 0;
        for (int k = 0; k < biggerCount; k++)
        {
            if (smaller.Length != countCheck)
            {
                if (bigger[k] == smaller[countCheck])
                    countCheck++;
             }
         }

        if (countCheck == smaller.Length)
        {
            alistIndex.Add(i);
            blistIndex.Add(j);
            res = true;
            break;
        }
        else
            res = false;  
    }
}

alistIndex.ForEach(i => Console.Write(i));
Console.WriteLine(Environment.NewLine);
blistIndex.ForEach(i => Console.Write(i));
Console.ReadKey();

the above code works just fine and looks very similar to this solution

but if change the order of the second list like so

 List<string> bList = new List<string> { "PartCode", "PartShortName", "PartName", "EquipmentType" };

i will get index 0, 1 and 3 (but i want 0 2 and 3)

should i check the distance for every pair and return the lowest? or should i work i a different method

Thanks

p.s i also found this GitHub, but i don't know if it will do the trick for me

please double check your examples and indexs, i could swear your contradicting yourself. Also be clear what you mean, you can not have it work in one case and not the other for shorthand... as per @L_J question. hes trying to get clarity on what you are wanting as it doesn't appear to be clear. — Seabizkit, Jun 03 '18 at 07:53
@Seabizkit maybe shorthand is not the right word to describe what I want but maybe "abbreviation" would be more suitable (as U.S and united states, so are EquipType and EquipmentType but not PartName PartShortName) Am i more clear now? — styx, Jun 03 '18 at 07:59
Question is really about how to compare two strings. After that, comparing of lists is trivial. So, what are the exact criteria when you consider two strings being the "same"? — L_J, Jun 03 '18 at 08:02
I feel that matching for example *Id* with *Idiotic* or *Foo* with *Foolish* would be quite strange :-) But it is what you are trying to do. Especially with short words like Id, there can be many matches that have totally different meaning while starting in the same why... It reminds me of persons that that tried to use regexes to match "bad" words, then discovering that *ass* was part o amb*ass*ador, c*ass*ette, *ass*umption... — xanatos, Jun 03 '18 at 08:49

xanatos · Answer 1 · 2018-06-03T10:43:44.107

I do feel that what you are trying to do is a bad idea... Id is the abbreviation of Idiotic, just to give an example :-) Still... I wanted to do some experiments on Unicode.

Now, this code will split words on uppercase letters. PartName is Part + Name because the N is uppercase. It doesn't support ID as Identifier (because it should be IDentifier) but it does support NSA as NotSuchAgency :-) So full acronyms are ok, while FDA isn't equivalent to FoodAndDrugAdministration, so acronyms with conjunctions are KO.

public static bool ShorthandCompare(string str1, string str2)
{
    if (str1 == null)
    {
        throw new ArgumentNullException(nameof(str1));
    }

    if (str2 == null)
    {
        throw new ArgumentNullException(nameof(str2));
    }

    if (str1 == string.Empty)
    {
        return str2 == string.Empty;
    }

    if (object.ReferenceEquals(str1, str2))
    {
        return true;
    }

    var ee1 = StringInfo.GetTextElementEnumerator(str1);
    var ee2 = StringInfo.GetTextElementEnumerator(str2);

    bool eos1, eos2 = true;

    while ((eos1 = ee1.MoveNext()) && (eos2 = ee2.MoveNext()))
    {
        string ch1 = ee1.GetTextElement(), ch2 = ee2.GetTextElement();

        // The string.Compare does some nifty tricks with unicode
        // like string.Compare("ì", "i\u0300") == 0
        if (string.Compare(ch1, ch2) == 0)
        {
            continue;
        }

        UnicodeCategory uc1 = char.GetUnicodeCategory(ch1, 0);
        UnicodeCategory uc2 = char.GetUnicodeCategory(ch2, 0);

        if (uc1 == UnicodeCategory.UppercaseLetter)
        {
            while (uc2 != UnicodeCategory.UppercaseLetter && (eos2 = ee2.MoveNext()))
            {
                ch2 = ee2.GetTextElement();
                uc2 = char.GetUnicodeCategory(ch2, 0);
            }

            if (!eos2 || string.Compare(ch1, ch2) != 0)
            {
                return false;
            }

            continue;
        }
        else if (uc2 == UnicodeCategory.UppercaseLetter)
        {
            while (uc1 != UnicodeCategory.UppercaseLetter && (eos1 = ee1.MoveNext()))
            {
                ch1 = ee1.GetTextElement();
                uc1 = char.GetUnicodeCategory(ch1, 0);
            }

            if (!eos1 || string.Compare(ch1, ch2) != 0)
            {
                return false;
            }

            continue;
        }

        // We already know they are different!
        return false;
    }

    if (eos1)
    {
        while (ee1.MoveNext())
        {
            string ch1 = ee1.GetTextElement();
            UnicodeCategory uc1 = char.GetUnicodeCategory(ch1, 0);

            if (uc1 == UnicodeCategory.UppercaseLetter)
            {
                return false;
            }
        }
    }
    else if (eos2)
    {
        while (ee2.MoveNext())
        {
            string ch2 = ee2.GetTextElement();
            UnicodeCategory uc2 = char.GetUnicodeCategory(ch2, 0);

            if (uc2 == UnicodeCategory.UppercaseLetter)
            {
                return false;
            }
        }
    }

    return true;
}

and then

List<string> aList = new List<string> { "Id", "PartCode", "PartName", "EquipType" };
List<string> bList = new List<string> { "PartCode", "PartName", "PartShortName", "EquipmentType" };

List<List<int>> matches = new List<List<int>>();

for (int i = 0; i < aList.Count; i++)
{
    var lst = new List<int>();
    matches.Add(lst);

    for (int j = 0; j < bList.Count; j++)
    {
        if (ShorthandCompare(aList[i], bList[j]))
        {
            lst.Add(j);
        }
    }
}

Note that the result is a List<List<int>>, because you could have multiple matches for a single word of aList!

Now... the interesting part of the ShorthandCompare is that it tries to be "intelligent" and handle non-BMP Unicode characters (through the use of StringInfo.GetTextElementEnumerator) and handle decomposed Unicode characters (the ì character can be obtained in Unicode through i + \u0300, that is its dieresis). It does it through the use of string.Compare that, differently than string.Equals, is Unicode-aware (string.CompareOrdinal is more similar to string.Equals and not Unicode-aware).

bool cmp1 = ShorthandCompare("IdìoLe\u0300ss", "Idi\u0300oticLèsser"); // true

fuzzy string compare (check for shorthand matching) C#

1 Answers1