107

I have a requirement which is relatively obscure, but it feels like it should be possible using the BCL.

For context, I'm parsing a date/time string in Noda Time. I maintain a logical cursor for my position within the input string. So while the complete string may be "3 January 2013" the logical cursor may be at the 'J'.

Now, I need to parse the month name, comparing it against all the known month names for the culture:

  • Culture-sensitively
  • Case-insensitively
  • Just from the point of the cursor (not later; I want to see if the cursor is "looking at" the candidate month name)
  • Quickly
  • ... and I need to know afterwards how many characters were used

The current code to do this generally works, using CompareInfo.Compare. It's effectively like this (just for the matching part - there's more code in the real thing, but it's not relevant to the match):

internal bool MatchCaseInsensitive(string candidate, CompareInfo compareInfo)
{
    return compareInfo.Compare(text, position, candidate.Length,
                               candidate, 0, candidate.Length, 
                               CompareOptions.IgnoreCase) == 0;
}

However, that relies on the candidate and the region we compare being the same length. Fine most of the time, but not fine in some special cases. Suppose we have something like:

// U+00E9 is a single code point for e-acute
var text = "x b\u00e9d y";
int position = 2;
// e followed by U+0301 still means e-acute, but from two code points
var candidate = "be\u0301d";

Now my comparison will fail. I could use IsPrefix:

if (compareInfo.IsPrefix(text.Substring(position), candidate,
                         CompareOptions.IgnoreCase))

but:

  • That requires me to create a substring, which I'd really rather avoid. (I'm viewing Noda Time as effectively a system library; parsing performance may well be important to some clients.)
  • It doesn't tell me how far to advance the cursor afterwards

In reality, I strongly suspect this won't come up very often... but I'd really like to do the right thing here. I'd also really like to be able to do it without becoming a Unicode expert or implementing it myself :)

(Raised as bug 210 in Noda Time, in case anyone wants to follow any eventual conclusion.)

I like the idea of normalization. I need to check that in detail for a) correctness and b) performance. Assuming I can make it work correctly, I'm still not sure how whether it would be worth changing over all - it's the sort of thing which will probably never actually come up in real life, but could hurt the performance of all my users :(

I've also checked the BCL - which doesn't appear to handle this properly either. Sample code:

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        var culture = (CultureInfo) CultureInfo.InvariantCulture.Clone();
        var months = culture.DateTimeFormat.AbbreviatedMonthNames;
        months[10] = "be\u0301d";
        culture.DateTimeFormat.AbbreviatedMonthNames = months;

        var text = "25 b\u00e9d 2013";
        var pattern = "dd MMM yyyy";
        DateTime result;
        if (DateTime.TryParseExact(text, pattern, culture,
                                   DateTimeStyles.None, out result))
        {
            Console.WriteLine("Parsed! Result={0}", result);
        }
        else
        {
            Console.WriteLine("Didn't parse");
        }
    }
}

Changing the custom month name to just "bed" with a text value of "bEd" parses fine.

Okay, a few more data points:

  • The cost of using Substring and IsPrefix is significant but not horrible. On a sample of "Friday April 12 2013 20:28:42" on my development laptop, it changes the number of parse operations I can execute in a second from about 460K to about 400K. I'd rather avoid that slowdown if possible, but it's not too bad.

  • Normalization is less feasible than I thought - because it's not available in Portable Class Libraries. I could potentially use it just for non-PCL builds, allowing the PCL builds to be a little less correct. The performance hit of testing for normalization (string.IsNormalized) takes performance down to about 445K calls per second, which I can live with. I'm still not sure it does everything I need it to - for example, a month name containing "ß" should match "ss" in many cultures, I believe... and normalizing doesn't do that.

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • While I understand your desire to avoid the performance hit of creating a substring, it might be best to do so, but earlier in the game by shifting everything to a chosen unicode normalization form FIRST and then knowing you can walk "point-by-point". Probably D-form. – IDisposable Apr 12 '13 at 20:42
  • @IDisposable: Yes, I did wonder about that. Obviously I can normalize the month names themselves beforehand. At least I can do the normalization just once. I wonder if the normalization procedure checks whether anything needs to be done first. I don't have much experience in normalization - definitely one avenue to look into. – Jon Skeet Apr 12 '13 at 20:56
  • 1
    If your `text` isn't too long, you could do `if (compareInfo.IndexOf(text, candidate, position, options) == position)`. http://msdn.microsoft.com/en-us/library/ms143031.aspx But if `text` is very long that's going to waste a lot of time searching beyond where it needs to. – Jim Mischel Apr 12 '13 at 21:31
  • @JimMischel: It's unlikely that the text will be very long, but it *could* be... and it also wouldn't tell me how far to move the cursor. – Jon Skeet Apr 12 '13 at 22:40
  • @JonSkeet: Normalization for sure. But I wonder -- since strings are immutable, doesn't `string.Substring` share the same backing data as the parent? And if it does, how bad can just the one allocation for the `string` itself be? OTOH, if you are parsing millions of dates... – Jon Apr 12 '13 at 22:46
  • @Jon: No, it doesn't - it used to in Java, but it doesn't any more, and it's never done so in .NET. To be honest, I haven't benchmarked this particular tweak - but I know that our parsing routines are pretty sensitive. I'll try it over the weekend if I can find time. I should probably do this not just for day/month matching, but also any text literals included in the pattern, which would potentially mean rather more substrings. Given that the BCL doesn't do this right either, I'm *really* tempted to just chalk this up to getting the 99.999% case right... – Jon Skeet Apr 12 '13 at 22:49
  • 1
    Just bypass using the `String` class **at all** in this instance and use a `Char[]` directly. You'll end up writing more code, but that's what happens when you want high performance... or maybe you should be programming in C++/CLI ;-) – intrepidis Apr 12 '13 at 23:05
  • @ChrisNash: Well I start off with two strings, so I can't avoid them being created. But the main problem with that idea is that then I've got to write all the comparison code myself, which I'm unlikely to do correctly. – Jon Skeet Apr 13 '13 at 06:45
  • 1
    Will [CompareOptions.IgnoreNonSpace](http://msdn.microsoft.com/en-gb/library/system.globalization.compareoptions.aspx) not take care of this automagically for you? It looks to me (from the docco, not in a position to test from this iPad sorry!) as though this might be a (_the_?) use-case for that option. "_Indicates that the string comparison must ignore nonspacing combining characters, such as diacritics._" – Sepster Apr 13 '13 at 10:03
  • Is it a problem that `string.Substring` doesn't take into account characters that are from two code points? If it is, then you may have to resort to `System.Globalization.StringInfo` or something. – intrepidis Apr 13 '13 at 18:58

3 Answers3

41

I'll consider the problem of many<->one/many casemappings first and separately from handling different Normalization forms.

For example:

x heiße y
  ^--- cursor

Matches heisse but then moves cursor 1 too much. And:

x heisse y
  ^--- cursor

Matches heiße but then moves cursor 1 too less.

This will apply to any character that doesn't have a simple one-to-one mapping.

You would need to know the length of the substring that was actually matched. But Compare, IndexOf ..etc throw that information away. It could be possible with regular expressions but the implementation doesn't do full case folding and so doesn't match ß to ss/SS in case-insensitive mode even though .Compare and .IndexOf do. And it would probably be costly to create new regexes for every candidate anyway.

The simplest solution to this is to just internally store strings in case folded form and do binary comparisons with case folded candidates. Then you can move the cursor correctly with just .Length since the cursor is for internal representation. You also get most of the lost performance back from not having to use CompareOptions.IgnoreCase.

Unfortunately there is no case fold function built-in and the poor man's case folding doesn't work either because there is no full case mapping - the ToUpper method doesn't turn ß into SS.

For example this works in Java (and even in Javascript), given string that is in Normal Form C:

//Poor man's case folding.
//There are some edge cases where this doesn't work
public static String toCaseFold( String input, Locale cultureInfo ) {
    return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
}

Fun to note that Java's ignore case comparison doesn't do full case folding like C#'s CompareOptions.IgnoreCase. So they are opposite in this regard: Java does full casemapping, but simple case folding - C# does simple casemapping, but full case folding.

So it's likely that you need a 3rd party library to case fold your strings before using them.


Before doing anything you have to be sure that your strings are in normal form C. You can use this preliminary quick check optimized for Latin script:

public static bool MaybeRequiresNormalizationToFormC(string input)
{
    if( input == null ) throw new ArgumentNullException("input");

    int len = input.Length;
    for (int i = 0; i < len; ++i)
    {
        if (input[i] > 0x2FF)
        {
            return true;
        }
    }

    return false;
}

This gives false positives but not false negatives, I don't expect it to slow down 460k parses/s at all when using Latin script characters even though it needs to be performed on every string. With a false positive you would use IsNormalized to get a true negative/positive and only after that normalize if necessary.


So in conclusion, the processing is to ensure normal form C first, then case fold. Do binary comparisons with the processed strings and move cursor as you are moving it currently.
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • Thanks for this - I'll need to look into normalization form C in more detail, but these are great pointers. I think I can live with the "it doesn't work quite correctly under the PCL" (which doesn't provide normalization). Using a 3rd party library for case-folding would be overkill here - we currently have no 3rd party dependencies, and introducing one just for a corner case that even the BCL doesn't handle would be a pain. Presumably case-folding is culture-sensitive, btw (e.g. Turkish)? – Jon Skeet Apr 15 '13 at 08:58
  • 2
    @JonSkeet yes, Turkic deserves its own mode in the casefold mappings :P See the format section in the header of [CaseFolding.txt](http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) – Esailija Apr 15 '13 at 10:17
  • This answer seems to have a fundamental flaw, in that it implies that characters map to ligatures (and vice versa) only when case-folding. This is not the case; there are ligatures that are considered equal to characters irrespective of casing. For example, under the en-US culture, `æ` is equal to `ae`, and `ffi` is equal to `ffi`. C-normalization does not handle ligatures at all, since it only allows compatibility mappings (which are typically restricted to combining characters). – Douglas Feb 19 '16 at 08:22
  • KC- and KD-normalization do handle some ligatures, such as `ffi`, but miss others, such as `æ`. The issue is made worse by the discrepancies between cultures -- `æ` is equal to `ae` under en-US, but *not* under da-DK, as discussed under the MSDN documentation for [strings](https://msdn.microsoft.com/en-us/library/system.string(v=vs.110).aspx#comparison). Thus, normalization (to any form) and case-mapping are not a sufficient solution for this problem. – Douglas Feb 19 '16 at 08:28
  • Small correction to my earlier comment: C-normalization only allows *canonical* mappings (such as for combining characters), not compatibility mappings (such as for ligatures). – Douglas Feb 19 '16 at 09:01
  • @Douglas And what problem is that? The problem is not, to my best knowledge, to make The Ultimate Character Comparer but to match month names case insensitively as fast as possible. – Esailija Feb 19 '16 at 09:18
  • The nature of the question is generic: “How can I perform a culture-sensitive ‘starts-with’ operation from the middle of a string?” Even if you restrict yourself to month names, then C normalization is either unnecessary or insufficient. If you're only dealing with English months, then you don't need normalization at all, whereas if you're dealing with other languages, then you'll need to support ligatures (which C-normalization doesn't). – Douglas Feb 19 '16 at 09:30
  • What is "you need to support x" even based on? There are arbitrary levels of sophistication of equality and you can always support better and more. – Esailija Feb 19 '16 at 10:25
  • There is just one standard “level of sophistication”, and that's the semantics of `string.Compare` and `string.Equals` (under the six `StringComparison` options, including `CurrentCulture`). Anything less is a bug when implemented in a general-purpose library – you can't expect your consumers to know that you support combining characters but not ligatures. – Douglas Feb 19 '16 at 10:57
21

See if this meets the requirement .. :

public static partial class GlobalizationExtensions {
    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)!=startIndex)
            return ~0;
        else
            // source is started with prefix
            // therefore the loop must exit
            for(int length2=0, length1=prefix.Length; ; )
                if(0==compareInfo.Compare(
                        prefix, 0, length1, 
                        source, startIndex, ++length2, options))
                    return length2;
    }
}

compareInfo.Compare only performs once source started with prefix; if it didn't, then IsPrefix returns -1; otherwise, the length of characters used in source.

However, I have no idea except increment length2 by 1 with the following case:

var candidate="ßssß\u00E9\u0302";
var text="abcd ssßss\u0065\u0301\u0302sss";

var count=
    culture.CompareInfo.IsPrefix(text, candidate, 5, CompareOptions.IgnoreCase);

update:

I tried to improve a little bit of perf., but it isn't proved whether there's bug in the following code:

public static partial class GlobalizationExtensions {
    public static int Compare(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, ref int length2, 
        CompareOptions options) {
        int length1=prefix.Length, v2, v1;

        if(0==(v1=compareInfo.Compare(
            prefix, 0, length1, source, startIndex, length2, options))
            ) {
            return 0;
        }
        else {
            if(0==(v2=compareInfo.Compare(
                prefix, 0, length1, source, startIndex, 1+length2, options))
                ) {
                ++length2;
                return 0;
            }
            else {
                if(v1<0||v2<0) {
                    length2-=2;
                    return -1;
                }
                else {
                    length2+=2;
                    return 1;
                }
            }
        }
    }

    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)
                !=startIndex)
            return ~0;
        else
            for(int length2=
                    Math.Min(prefix.Length, source.Length-(1+startIndex)); ; )
                if(0==compareInfo.Compare(
                        source, prefix, startIndex, ref length2, options))
                    return length2;
    }
}

I tested with the particular case, and the comparision down to about 3.

Ken Kin
  • 4,503
  • 3
  • 38
  • 76
  • I'd *really* rather not have to loop like this. Admittedly with the early out it will only need to loop if it's found something, but I'd still rather not have to make 8 string comparisons just to match "February" for example. It feels like there must be a better way. Also, the initial `IndexOf` operation has to look through the entire string from the starting position, which would be a performance pain if the input string is long. – Jon Skeet Apr 18 '13 at 05:44
  • @JonSkeet: Thank you. Maybe there's something can be added to detect if the loop can be decreased. I'll think about that. – Ken Kin Apr 18 '13 at 05:51
  • @JonSkeet: Would you consider to use reflection? Since I traced into the methods, they fall into invoking native methods not far. – Ken Kin Apr 18 '13 at 07:02
  • No, that would be way too brittle - and it would also fail in low-trust environments. – Jon Skeet Apr 18 '13 at 07:07
  • @JonSkeet: Then I almost cannot figure out a way to improve with the particular case like the example in the answer. Maybe call it after check current culture first, or only when other parse approaches were failed. The thinking of ***invoke a rare used method when a rare case occurs***. – Ken Kin Apr 18 '13 at 07:19
  • Sure. I'd like to think there's a way of making the "it's not present at all" case faster though. The annoying thing is that as you say, I'm sure there are native methods which do exactly what I need... – Jon Skeet Apr 18 '13 at 07:31
  • @JonSkeet: Sounds like even if porting the methods from BCL would not be better, that would cause you have to maintain the code which manipulates unicode .. – Ken Kin Apr 18 '13 at 07:46
  • 3
    Indeed. Noda Time doesn't want to get into the business of Unicode details :) – Jon Skeet Apr 18 '13 at 08:19
  • A way you could improve it is to know the language the months and days are in and maintain a list whether that language requires full case folding. Then you don't have to do the O(n²) loop for example when you know the strings passed in will be English. But it is not unlike maintaining case fold mappings in the first place for sweet binary/ordinal comparisons. But you have to wonder if users even expect this seeing as even ToUpper/ToLower doesn't work. – Esailija Apr 18 '13 at 10:46
  • 2
    I have solved a similar problem once like this (search-string highlighting in HTML). I did it similarly. You can tune the loop and search strategy in a way that makes it completed very quickly by checking the likely cases first. The nice thing about this is that it seems to be totally correct and no Unicode details leak into your code. – usr Apr 18 '13 at 11:02
  • @Esailija: Thank you. I've seen implementation detail of the methods, almost directly fall into invoking native methods after some argument check. An idea that I thought of as your suggestion is like `String.Intern`, however, I not yet think out how to. – Ken Kin Apr 18 '13 at 11:21
  • @usr can you give specifics? How would you be able to tell that there are complex (not one-to-one) mappings without knowing about Unicode? – Esailija Apr 18 '13 at 11:25
  • Does this give exception to you (when using the new code): `int len = CultureInfo.GetCultureInfo("de").CompareInfo.IsPrefix("x ßheßieß y", "sshessiess", 2, CompareOptions.IgnoreCase);` – Esailija Apr 25 '13 at 13:39
  • @Esailija: Thank you for this, and I've revised, hope it's fixed for all the cases .. – Ken Kin Apr 25 '13 at 14:25
9

This is actually possible without normalization and without using IsPrefix.

We need to compare the same number of text elements as opposed to the same number of characters, but still return the number of matching characters.

I've created a copy of the MatchCaseInsensitive method from ValueCursor.cs in Noda Time and modified it slightly so that it can be used in a static context:

// Noda time code from MatchCaseInsensitive in ValueCursor.cs
static int IsMatch_Original(string source, int index, string match, CompareInfo compareInfo)
{
    unchecked
    {
        if (match.Length > source.Length - index)
        {
            return 0;
        }

        // TODO(V1.2): This will fail if the length in the input string is different to the length in the
        // match string for culture-specific reasons. It's not clear how to handle that...
        if (compareInfo.Compare(source, index, match.Length, match, 0, match.Length, CompareOptions.IgnoreCase) == 0)
        {
            return match.Length;
        }

        return 0;
    }
}

(Just included for reference, it is the code that won't compare properly as you know)

The following variant of that method uses StringInfo.GetNextTextElement which is provided by the framework. The idea is to compare text element by text element to find a match and if found return the actual number of matching characters in the source string:

// Using StringInfo.GetNextTextElement to match by text elements instead of characters
static int IsMatch_New(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < source.Length && matchIndex < match.Length)
    {
        // Get text elements at the current positions of source and match
        // Normally that will be just one character but may be more in case of Unicode combining characters
        string sourceElem = StringInfo.GetNextTextElement(source, sourceIndex);
        string matchElem = StringInfo.GetNextTextElement(match, matchIndex);

        // Compare the current elements.
        if (compareInfo.Compare(sourceElem, matchElem, CompareOptions.IgnoreCase) != 0)
        {
            return 0; // No match
        }

        // Advance in source and match (by number of characters)
        sourceIndex += sourceElem.Length;
        matchIndex += matchElem.Length;
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != match.Length)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

That method works just fine at least according to my test cases (which basically just test a couple of variants of the strings you've provided: "b\u00e9d" and "be\u0301d").

However, the GetNextTextElement method creates a substring for each text element so this implementation requires alot of substring comparisons - which will have an impact on performance.

So, I created another variant that does not use GetNextTextElement but instead skips over Unicode combining characters to find the actual match length in characters:

// This should be faster
static int IsMatch_Faster(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceLength = source.Length;
    int matchLength = match.Length;
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < sourceLength && matchIndex < matchLength)
    {
        sourceIndex += GetTextElemLen(source, sourceIndex, sourceLength);
        matchIndex += GetTextElemLen(match, matchIndex, matchLength);
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != matchLength)
    {
        return 0; // No match
    }

    // Check if we've found a match
    if (compareInfo.Compare(source, index, sourceIndex - index, match, 0, matchIndex, CompareOptions.IgnoreCase) != 0)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

That method uses the following two helpers:

static int GetTextElemLen(string str, int index, int strLen)
{
    bool stop = false;
    int elemLen;

    for (elemLen = 0; index < strLen && !stop; ++elemLen, ++index)
    {
        stop = !IsCombiningCharacter(str, index);
    }

    return elemLen;
}

static bool IsCombiningCharacter(string str, int index)
{
    switch (CharUnicodeInfo.GetUnicodeCategory(str, index))
    {
        case UnicodeCategory.NonSpacingMark:
        case UnicodeCategory.SpacingCombiningMark:
        case UnicodeCategory.EnclosingMark:
            return true;

        default:
            return false;
    }
}

I haven't done any bench marking, so I don't really know whether the faster method is actually faster. Nor have I done any extended testing.

But this should answer your question on how to perform cultural sensitive substring matching for strings that may include Unicode combining characters.

These are the test cases I've used:

static Tuple<string, int, string, int>[] tests = new []
{
    Tuple.Create("x b\u00e9d y", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d y", 2, "b\u00e9d", 4),

    Tuple.Create("x b\u00e9d", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d", 2, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d y", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d y", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9", 0, "be\u0301d", 0),
    Tuple.Create("be\u0301", 0, "b\u00e9d", 0),
};

The tuple values are:

  1. The source string (haystack)
  2. The starting position in source.
  3. The match string (needle).
  4. The expected match length.

Running those tests on the three methods yields this result:

Test #0: Orignal=BAD; New=OK; Faster=OK
Test #1: Orignal=BAD; New=OK; Faster=OK
Test #2: Orignal=BAD; New=OK; Faster=OK
Test #3: Orignal=BAD; New=OK; Faster=OK
Test #4: Orignal=BAD; New=OK; Faster=OK
Test #5: Orignal=BAD; New=OK; Faster=OK
Test #6: Orignal=BAD; New=OK; Faster=OK
Test #7: Orignal=BAD; New=OK; Faster=OK
Test #8: Orignal=OK; New=OK; Faster=OK
Test #9: Orignal=OK; New=OK; Faster=OK

The last two tests are testing the case when the source string is shorter than the match string. In this case the original (Noda time) method will succeed as well.

Mårten Wikström
  • 11,074
  • 5
  • 47
  • 87
  • Thank you so much for this. I'll need to look over it in detail to see how well it performs, but it looks like a great starting point. More knowledge of Unicode (in the code itself) than I'd *hoped* would be required, but if the platform doesn't do what's required, there's not a lot I can do about that :( – Jon Skeet Mar 19 '14 at 17:12
  • @JonSkeet: Glad to be of any help! And yes, substring matching with Unicode support should definately have been included in the framework... – Mårten Wikström Mar 19 '14 at 17:22