3

There are many ways to compare two strings to find the first index where they differ, but if I require case-insensitivity in any given culture, then the options go away.

This is the only way I can think to do such a comparison:

static int FirstDiff(string str1, string str2)
{
    for (int i = 1; i <= str1.Length && i <= str2.Length; i++)
        if (!string.Equals(str1.Substring(0, i), str2.Substring(0, i), StringComparison.CurrentCultureIgnoreCase))
            return i - 1;
    return -1; // strings are identical
}

Can anyone think of a better way that doesn't involve so much string allocation?

For testing purposes:

// Turkish word 'open' contains the letter 'ı' which is the lowercase of 'I' in Turkish, but not English
string lowerCase = "açık";
string upperCase = "AÇIK";

Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
FirstDiff(lowerCase, upperCase); // Should return 2

Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
FirstDiff(lowerCase, upperCase); // Should return -1

Edit: Checking both ToUpper and ToLower for each character appears to work for any example that I can come up with. But will it work for all cultures? Perhaps this is a question better directed at linguists.

Rhaokiel
  • 813
  • 6
  • 17
  • 1
    Are you doing this a *lot*? Is this a performance critical app? If not, why are you bothered about string allocations? – stuartd Aug 27 '19 at 20:48
  • A simple way to amortize the string allocations is to uppercase them first before comparing. – Hans Passant Aug 27 '19 at 20:50
  • 1
    A general way to speed up your code would be to use a binary search instead of linear search for choosing substrings. If the strings are equal, you've cut your comparisons in half. Continue this process of getting the middle index of those remaining, and your comparisons are reduced exponentially. (i.e. to compare the strings `"FinalCountdown"` and `"FinalCountdoon`", you would compare `"FinalCo"` first, then `"FinalCountd"`, then `"FinalCountdow"` and finally `"FinalCountdo"`. So you'd go from `13` comparisons down to only `4` comparisons. – Rufus L Aug 27 '19 at 21:03
  • @RufusL still, substring comparison requires allocating an additional string AND has O(n) complexity. – orhtej2 Aug 27 '19 at 21:07
  • 1
    After searching the net some more I came across [Unicode case folding rules](https://www.unicode.org/versions/Unicode9.0.0/ch03.pdf#G33992) that would address the issue. Judging from [NStack's implementation](https://github.com/migueldeicaza/NStack/blob/3fc024fb2c2e99927d3e12991570fb54db8ce01e/NStack/strings/ustring.cs#L784) the problem is more complicated, and relies on [folding Unicode characters](https://github.com/migueldeicaza/NStack/blob/3fc024fb2c2e99927d3e12991570fb54db8ce01e/NStack/unicode/Letter.cs#L492) rather than simple upper/lowercasing. – orhtej2 Aug 28 '19 at 20:29
  • @orhtej2, Somebody needs to get you a comment of the day award. That Unicode case folding document is quite relevant and well worth reading for any programmer dealing with string interpretation. – Rhaokiel Aug 29 '19 at 12:50

5 Answers5

2

One way to reduce the number of string allocations is to reduce the number of times you do a comparison. We can borrow from the binary search algorithm for searching arrays in this case, and start out by comparing a substring that is half the length of the string. Then we continue to add or remove half of the remaining indexes (depending on whether or not the strings were equal), until we get to the first instance of inequality.

In general this should speed up the search time:

public static int FirstDiffBinarySearch(string str1, string str2)
{
    // "Fail fast" checks
    if (string.Equals(str1, str2, StringComparison.CurrentCultureIgnoreCase))
        return -1;
    if (str1 == null || str2 == null) return 0;

    int min = 0;
    int max = Math.Min(str1.Length, str2.Length);
    int mid = (min + max) / 2;

    while (min <= max)
    {               
        if (string.Equals(str1.Substring(0, mid), str2.Substring(0, mid), 
            StringComparison.CurrentCultureIgnoreCase))
        {
            min = mid + 1;                    
        }
        else
        {
            max = mid - 1;
        }

        mid = (min + max) / 2;
    }

    return mid;
}
Rufus L
  • 36,127
  • 5
  • 30
  • 43
1

You need to check both ToLower and ToUpper.

private static int FirstDiff(string str1, string str2)
{
    int length = Math.Min(str1.Length, str2.Length);
    TextInfo textInfo = CultureInfo.CurrentCulture.TextInfo;
    for (int i = 0; i < length; ++i)
    {
        if (textInfo.ToUpper(str1[i]) != textInfo.ToUpper(str2[i]) ||
            textInfo.ToLower(str1[i]) != textInfo.ToLower(str2[i]))
        {
            return i;
        }
    }
    return str1.Length == str2.Length ? -1 : length;
}
ClickRick
  • 1,553
  • 2
  • 17
  • 37
  • 1
    Will this be true for any culture? I'm not well versed in languages besides English and Spanish. For example, are there cultures where there are multiple case bindings (more than 2) for a given letter? – Rhaokiel Aug 28 '19 at 12:45
1

You could compare characters instead of strings. This is far from optimized, and rather quick and dirty, but something like this appears to work

for (int i = 0; i < str1.Length && i < str2.Length; i++)
   if (char.ToLower(str1[i]) != char.ToLower(str2[i]))
       return i;

This should work with culture as well, according to the docs: https://learn.microsoft.com/en-us/dotnet/api/system.char.tolower?view=netframework-4.8

Casing rules are obtained from the current culture.

To convert a character to lowercase by using the casing conventions of the current culture, call the ToLower(Char, CultureInfo) method overload with a value of CurrentCulture for its culture parameter.

Community
  • 1
  • 1
entropic
  • 1,683
  • 14
  • 27
1

I am reminded of one additional oddity of characters (or rather unicode code points): there are some that act as surrogate pairs and they have no relevance to any culture unless the pair appears next to one another. For more information about Unicode interpretation standards see the document that @orhtej2 linked in his comment.

While trying out different solutions I stumbled upon this particular class, and I think it will best suit my needs: System.Globalization.StringInfo (The MS Doc Example shows it in action with surrogate pairs)

The class breaks the string down into sections by pieces that need each other to make sense (rather than by strictly character). I can then compare each piece by culture using string.Equals and return the index of the first piece that differs:

static int FirstDiff(string str1, string str2)
{
    var si1 = StringInfo.GetTextElementEnumerator(str1);
    var si2 = StringInfo.GetTextElementEnumerator(str2);

    bool more1, more2;
    while ((more1 = si1.MoveNext()) & (more2 = si2.MoveNext())) // single & to avoid short circuiting the right counterpart
        if (!string.Equals(si1.Current as string, si2.Current as string, StringComparison.CurrentCultureIgnoreCase))
            return si1.ElementIndex;

    if (more1 || more2)
        return si1.ElementIndex;
    else
        return -1; // strings are equivalent
}
Rhaokiel
  • 813
  • 6
  • 17
0

Here's a little different approach. Strings are technically arrays of char, so I'm using that along with LINQ.

var list1 = "Hellow".ToLower().ToList();
var list2 = "HeLio".ToLower().ToList();

var diffIndex = list1.Zip(list2, (item1, item2) => item1 == item2)
                .Select((match, index) => new { Match = match, Index = index })
                .Where(a => !a.Match)
                .Select(a => a.Index).FirstOrDefault();

If they match, diffIndex will be zero. Otherwise it will be the index of the first mismatching character.

Edit:

A little improved version with casting to lower case on the go. And initial ToList() was really redundant.

var diffIndex = list1.Zip(list2, (item1, item2) => char.ToLower(item1) == char.ToLower(item2))
                .Select((match, index) => new { Match = match, Index = index })
                .Where(a => !a.Match)
                .Select(a => a.Index).FirstOrDefault();

Edit2:

Here's a working version where it can be further shortened. This is the best answer since in the previous two you'll get 0 if strings match. Here' if strings match you get null and the index otherwise.

var list1 = "Hellow";
var list2 = "HeLio";

var diffIndex = list1.Zip(list2, (item1, item2) => char.ToLower(item1) == char.ToLower(item2))
                .Select((match, index) => new { Match = match, Index = index })
                .FirstOrDefault(x => !x.Match)?.Index;
Sach
  • 10,091
  • 8
  • 47
  • 84
  • The `ToList()` call is unnecessary. Also, you can mix and match this approach with lowercasing on the go. My complaint would be it creates additional instances of an anonymous type. – orhtej2 Aug 27 '19 at 21:06
  • Oh year agree on both counts. And I'm not even sure if this is the fastest approach, I think other answers could probably be better. I thought this'd be a little fun approach, a different one. – Sach Aug 27 '19 at 21:08
  • I would assume this only calls each lambda till the first difference is found so no real performance penalty would be involved? Unless `yield return` is that heavy? – orhtej2 Aug 27 '19 at 21:13