1

I have custom IComparer<string> which I use to compare strings ignoring their case and symbols like this:

public class LiberalStringComparer : IComparer<string>
{
    private readonly CompareInfo _compareInfo = CultureInfo.InvariantCulture.CompareInfo;
    private const CompareOptions COMPARE_OPTIONS = CompareOptions.IgnoreSymbols | CompareOptions.OrdinalIgnoreCase;

    public int Compare(string x, string y)
    {
        if (x == null) return -1;
        if (y == null) return 1;

        return this._compareInfo.Compare(x, y, COMPARE_OPTIONS);
    }
}

Can I obtain the output string which is, ultimately, used for the comparison?

My final goal is to produce an IEqualityComparer<string> which ignores symbols and casing in the same way as this comparer.

I can write regex to do this, but there's no guarantee that my regex will use the same logic as the built-in comparison options do.

Ian Kemp
  • 28,293
  • 19
  • 112
  • 138
Matthew
  • 10,244
  • 5
  • 49
  • 104
  • If you're interested in just `Equals` you can do `yourComparer.Compare(x,y) == 0` – Sriram Sakthivel Apr 16 '14 at 19:26
  • @SriramSakthivel yes but that doesn't fulfill all the requirements of `IEqualityComparer` ... I still will need `GetHashCode` – Matthew Apr 16 '14 at 19:30
  • Is it possible that the strings should be parsed into a new, consistent structure (or converted to a consistent sort of string) in a certain way *before* comparing them? It'd make all of this much simpler, conceptually. – Tim S. Apr 16 '14 at 19:49
  • @TimS. I believe that your answer relies upon that very approach! The reality in that case would be that I wouldn't be able to use the built-in rules for the parsing but would use my own well-defined rules. – Matthew Apr 16 '14 at 19:50
  • In a sense, but if you can move the logic from comparing with your options to parsing in a way you know, and then represent it in a simpler fashion, then understanding and comparing your data could become much simpler. I don't know what your data represents, but e.g. if it were dollar/currency amounts, you might parse them as `decimal` first. – Tim S. Apr 16 '14 at 19:53

2 Answers2

2

Quite interesting question here. Internally CompareInfo.Compare uses InternalCompareString method importing COMNlsInfo::InternalCompareString from clr.dll:

// Compare a string using the native API calls -- COMNlsInfo::InternalCompareString   
...
private static extern int InternalCompareString(IntPtr handle, 
             IntPtr handleOrigin, String localeName, String string1, int offset1, 
             int length1, String string2, int offset2, int length2, int flags);

In other words, as you can't be sure about the logic of the built-in function, maybe you should write your own and reuse it in both IEqualityComparer and IComparer implementations.

Ian Kemp
  • 28,293
  • 19
  • 112
  • 138
Konrad Kokosa
  • 16,563
  • 2
  • 36
  • 58
  • +1 for demonstrating that I can't reliably replicate the built-in function, and should use a custom and well-defined `IComparer` and `IEqualityComparer` – Matthew Apr 16 '14 at 19:50
1

There is probably not such an "output string". I'd implement your Equals in this way:

return liberalStringComparer.Compare(x, y) == 0;

GetHashCode is more complicated.

Some approaches:

  1. Use a poor implementation like return 0; (which means you always have to run a Compare to know if they're equal).
  2. Since your comparison is relatively simple (invariant culture, ordinal ignore case comparison), you should be able to make a hash that generally works. Without extensive study of Unicode and testing, however, I wouldn't recommend that you assume this'll work for any valid Unicode string from any culture.

    In pseudocode:

    public int GetHashCode(string value)
    {
        // for each index in value
        if (!char.IsSymbol(value, i))
            // add value[i].ToUpperInvariant() to the hash using an algorithm
            // like http://stackoverflow.com/a/263416/781792
    }
    
  3. Form a string by removing all where char.IsSymbol is true, then use StringComparer.InvariantCulture.GetHashCode on it.
  4. CompareInfo.GetSortKey's hash code should be a suitable value.

    public int GetHashCode(string value)
    {
        return _compareInfo.GetSortKey(value, COMPARE_OPTIONS).GetHashCode();
    }
    
Tim S.
  • 55,448
  • 7
  • 96
  • 122
  • I suppose I could remove all chars which return `true` for `char.IsSymbol || char.IsWhiteSpace` and then perform a `CultureInvariantIgnoreCase.GetHashCode` on those resulting strings... Alternatively I could use the `GetUnicodeCategory` method and explicitly exclude categories. – Matthew Apr 16 '14 at 19:40
  • It appears that `char.IsSymbol` doesn't return true for whitespace. I think, instead, I want to explicitly include things which are `char.IsLetterOrDigit` – Matthew Apr 16 '14 at 19:46
  • I selected this answer because I had to create my own reliable string procesor to remove spaces and symbols, then I used the CultureInvariantCaseInsensitive solution which is built-in – Matthew Apr 17 '14 at 00:57