3

Why does String.IndexOf(String, StringComparison) require a StringComparison and not allow for the more general StringComparer, or even just IComparer<T> or IEqualityComparer<T>?

I made a custom StringComparer to use with several dictionaries, and I want to use it in other parts of my project but I can't find a good way to do that without making lots of extensions methods, if those would even work.

This is the comparer I made. It was based roughly on this recommendation: Implementing custom IComparer with string

Also note that ModifyString is a WIP. I expect to add more things there, based on the input that I'm comparing against. I also know that it's expensive, but I'm just looking for a solution ATM, not performance.

public class CustomComparer : StringComparer
{
    public override int Compare(string x, string y)
    {
        return StringComparer.Ordinal.Compare(ModifyString(x), ModifyString(y));
    }

    public override bool Equals(string x, string y)
    {
        if (ModifyString(x).Equals(ModifyString(y)))
            return true;
        else
            return false;
    }

    public override int GetHashCode(string obj)
    {
        if (obj == null)
            return 0;
        else
            return ModifyString(obj).GetHashCode();
    }

    private string ModifyString(string s)
    {
        //I know this code is expensive/naaive, your suggestions are welcome.
        s = s.ToLowerInvariant();
        s = s.Trim();
        s = Regex.Replace(s, @"\s+", " ");//replaces all whitespace characters with a single space.
        return s;
    }
}
Isaiah Shiner
  • 443
  • 6
  • 18
  • 2
    Examples are always helpful. – Dan Wilson May 14 '18 at 20:11
  • 1
    How would you search with an entirely custom comparer? You do not know anything about lengths of substrings or anything. For example, to search for `"x"` within `"horse"`, would you compare `"x"` to all sub-words of all lengths? – Jeppe Stig Nielsen May 14 '18 at 20:12
  • I'd have to assume that the issue with having a standard implementation is that it may assume things that your StringComparer may not. Optimizations to exit early with -1, such as one input being null and the other not, may not be valid when presented with the `NullMatchesWhenTheStartStringBeginsWithTwoZeroesStringComparer` implementation. It goes more confusing if you have a `StringComparer` that reduces repeating character sequences, like 'aaabb' to 'ab' for the purpose of finding the index match. I'm not sure how you'd go about confirming you were comparing the correct sub string. – Jonathon Chase May 14 '18 at 20:24
  • `"horse".IndexOf("x", …)` [You can try this](https://tio.run/##dVLPb4IwFL7zV7yQJUJEMq9DdyHGmWi2xMPOtTykSW1dWxjE8Le7IsVhlt3a974f730t1TOq6fVaaiaOsG@0wVPijW9xKjlHapgUOl6jQMVo4nnaEMMoUE60hg8lj4qcvMtQriTLYEeYCEJb9AupNPrxRmRYv@eBX/sRCPyGFS1kKk9nolAFYZh4ref1iuMOvMDqqyScmWYoLbRRdsRXK34uD9w6ygqVYhnCQUre43XQo6COwJ0aOw6kdhPJMf5UzOCWCQyefCcMk0vdToCIzJ6adhL7YQIKTakE5FYRE2j/ODJhYI3mjegilRneXTsvx33ueHY5R3UpdcQhE1MwPUypZako3oeuCC/tbfNfCEBdocs6lwqCiqjOQ5mbOiw7/9F94RziLYqjKca96fRBhN8AVmBHTBHv7HvOoxHahuMQi@WjJsxGsAHVi7McgmHieHionrsvD/1OwS85cuQw6oMIR7GOLbp8W881ZvPbX2qv1x8). C# source is encoded in URL. – Jeppe Stig Nielsen May 14 '18 at 20:55
  • @DanWilson Added example code. – Isaiah Shiner May 14 '18 at 21:06

2 Answers2

2

Using a convenient extension for IEnumerable it seems like it should have already, you can write a String extension to use a StringComparer. As suggested in a comment, all possible substring lengths are tested at each position since no assumption about the custom StringComparer can be made.

public static class IEnumerableExt {
    public static T FirstOrDefault<T>(this IEnumerable<T> src, Func<T, bool> testFn, T defval) => src.Where(aT => testFn(aT)).DefaultIfEmpty(defval).First();
}

public static class StringExt {
    public static int IndexOf(this string source, string match, StringComparer sc) {
        return Enumerable.Range(0, source.Length) // for each position in the string
                         .FirstOrDefault(i => // find the first position where either
                             // match is Equal at this position for length of match (or to end of string) or
                             sc.Equals(source.Substring(i, Math.Min(match.Length, source.Length-i)), match) ||
                             // match is Equal to one of the substrings beginning at this position
                             Enumerable.Range(1, source.Length-i).Any(ml => sc.Equals(source.Substring(i, ml), match)),
                             -1 // else return -1 if no position matches
                          );
    }
}

Note: Modified to properly handle when the source substring and match string lengths may not be equal.

NetMage
  • 26,163
  • 3
  • 34
  • 55
  • Works perfectly. I don't exactly follow how it works, LINQ is not my strong suit, but I'll credit you in the comments, and maybe figure it out at some point. Thank you! – Isaiah Shiner May 15 '18 at 14:46
  • @IsaiahShiner I added some comments to the code to attempt to help. – NetMage May 15 '18 at 19:16
  • Are there any unit tests written for this code? – Tom Bogle Sep 14 '21 at 14:48
  • I wrote some unit tests for this using a comparar that can consider two strings "equal" even if they differ in length. In this scenario, this solution does not work because the Math.Min considers the unadjusted string lengths, and in my comparer a longer string can actually be regarded as a substring of a shorter one. (A potentially practical example of this would be a comparer that regarded the German Fluss (ß) to be equal to "ss". Then IndexOf("Fluß", "Fluss", myFlussComparer) should return 0. – Tom Bogle Sep 14 '21 at 19:59
  • @TomBogle I modified the method to work in that case, I believe. – NetMage Sep 15 '21 at 19:50
  • Thanks! Good Work! That makes my tests pass. – Tom Bogle Sep 16 '21 at 20:31
  • After a bit more consideration, I can see that there are still potential problem cases. If a comparer treats two characters as a dipthong, then a potential substring that splits the dipthong is judged to be a substring by the proposed algorithm, but maybe it shouldn't be. I think that in order to solve this, all characters from the source string that precede the index position being tested would have to be prepended to the match string and the comparison would have to be done starting from the beginning of the source string. It might end up being more efficient. I might take a crack at it... – Tom Bogle Sep 17 '21 at 15:47
  • After even more consideration, I think I'm trying to solve a problem which contains a logical contradiction and is therefore unsolvable. Seen from one perspective "his" is definitely a substring of "This" no matter what your string comparison algorithm is (by virtue of it being character-for-character ordinally identical). Seen from the other angle, if "Th" is an (indivisible) diphthong, then there is no way that "his" could be a substring since "Th" and "h" are not the same. – Tom Bogle Sep 17 '21 at 17:42
  • @TomBogle The OP sample `StringComparer` uses a `ModifyString` routine before comparison - if that routine converted `Th` into a single diphthong character, `his` would not match. So if you canonicalize both strings before comparison, that should handle it? – NetMage Sep 17 '21 at 18:13
  • Yes, fair point. Unfortunately, all I have in my situation is an IComparer, so that isn't an option for me. Of course, in reality, in order for the result of IndexOf to be meaningful, you would have to canonicalize the two strings BEFORE calling IndexOf, and then if you needed to know how that index related to some (potentially ambiguous or non-existent) position in the original strings, you're still kind of stuck. – Tom Bogle Sep 20 '21 at 12:06
  • @TomBogle Thinking more, it seems like your hypothetical diphthong `StringComparer` isn't possible - it doesn't have the context to determine if the `h` in `his` is preceded by a `T`. IOTW, the `StringComparer` and the `IndexOf` aren't compatible in that case (e.g. `StringComparer.Equals(s.Substring(1, 3), 'his') == true` but `s.IndexOf('his') == -1`). – NetMage Sep 20 '21 at 21:06
1

In case anyone needs a version of it that works with an IComparer<String>, here's a trivial modification of the good solution posted by @NetMage:

        public static int IndexOf(this string source, string match, IComparer<String> sc) {
        return Enumerable.Range(0, source.Length) // for each position in the string
            .FirstOrDefault(i => // find the first position where either
                    // match is Equal at this position for length of match (or to end of string) or
                    sc.Compare(source.Substring(i, Math.Min(match.Length, source.Length-i)), match) == 0 ||
                    // match is Equal to one of the substrings beginning at this position
                    Enumerable.Range(1, source.Length-i).Any(ml => sc.Compare(source.Substring(i, ml), match) == 0),
                -1 // else return -1 if no position matches
            );
    }
Tom Bogle
  • 464
  • 4
  • 18