21

I've noticed this strange issue. Check out this Vietnamese (according to Google Translate) string:

string line = "Mìng-dĕ̤ng-ngṳ̄";
string sub = "Mìng-dĕ̤ng-ngṳ";
line.Length
15
sub.Length
14
line.StartsWith(sub)
false

Which seems to me like a wrong result. So, I've implemented my custom StartWith function, which compares the string char-by-char.

public bool CustomStartWith(string parent, string child)
{
    for (int i = 0; i < child.Length; i++)
    {
        if (parent[i] != child[i])
            return false;
    }
    return true;
}

And as I assumed, the results of running this function

CustomStartWith("Mìng-dĕ̤ng-ngṳ̄", "Mìng-dĕ̤ng-ngṳ")
true

What's going on here?! How's this possible?

Douglas
  • 53,759
  • 13
  • 140
  • 188
No1Lives4Ever
  • 6,430
  • 19
  • 77
  • 140
  • 3
    Use an invariant culture. See, for instance, http://stackoverflow.com/q/492799/1364007 – Wai Ha Lee Jan 12 '16 at 10:32
  • 6
    I don't know Vietnamese. There is a line over the last 'u'. Does that not make it a different letter? *Edit:* I missed that you're printing the lengths, which seems to show that the line is considered *another* character... interesting. – Jonathon Reinhart Jan 12 '16 at 10:39
  • 4
    Seems to me that `StartsWith()` SHOULD be returning false, because (as Jonathon points out), `line` does not actually start with `sub`. – Matthew Watson Jan 12 '16 at 10:43
  • @MatthewWatson: Maybe, maybe not. You really need a Vietnamese speaker to decide that (and I assume that even within that community you can argue for both cases. This looks to me like the ä vs a vs ae in German -- for which the answer can only be: it depends.... ) – Thilo Jan 12 '16 at 10:46
  • If you copy&paste the u with line on top to notepad++ it actually displays 2 characters (with utf-8), if you paste it to notepad it displays with the upper line but you need 2 backspaces to delete it.. – Janne Matikainen Jan 12 '16 at 10:47
  • 2
    @Thilo No, it really is different. The last `u` is a different letter. And also, to see the true length you need to use `new StringInfo(line).LengthInTextElements` which returns 13, not the 15 that string.Length returns. – Matthew Watson Jan 12 '16 at 10:48
  • I understand that. ä and ae are also two different letters. But maybe you want to treat them the same for ordering strings, maybe you don't. – Thilo Jan 12 '16 at 10:49
  • @JanneMatikainen When I paste `ṳ̄` to notepad++, it displays only the single `ṳ̄` character... We must have different settings. – Matthew Watson Jan 12 '16 at 10:50

2 Answers2

37

The result returned by StartsWith is correct. By default, most string comparison methods perform culture-sensitive comparisons using the current culture, not plain byte sequences. Although your line starts with a byte sequence identical to sub, the substring it represents is not equivalent under most (or all) cultures.

If you really want a comparison that treats strings as plain byte sequences, use the overload:

line.StartsWith(sub, StringComparison.Ordinal);                       // true

If you want the comparison to be case-insensitive:

line.StartsWith(sub, StringComparison.OrdinalIgnoreCase);             // true

Here's a more familiar example:

var line1 = "café";   // 63 61 66 E9     – precomposed character 'é' (U+00E9)
var line2 = "café";   // 63 61 66 65 301 – base letter e (U+0065) and
                      //                   combining acute accent (U+0301)
var sub   = "cafe";   // 63 61 66 65 
Console.WriteLine(line1.StartsWith(sub));                             // false
Console.WriteLine(line2.StartsWith(sub));                             // false
Console.WriteLine(line1.StartsWith(sub, StringComparison.Ordinal));   // false
Console.WriteLine(line2.StartsWith(sub, StringComparison.Ordinal));   // true

In the above examples, line2 starts with the same byte sequence as sub, followed by a combining acute accent (U+0301) to be applied to the final e. line1 uses the precomposed character for é (U+00E9), so its byte sequence does not match that of sub.

In real-world semantics, one would typically not consider cafe to be a substring of café; the e and are treated as distinct characters. That happens to be represented as a pair of characters starting with e is an internal implementation detail of the encoding scheme (Unicode) that should not affect results. This is demonstrated by the above example contrasting café and café; one would not expect different results unless specifically intending an ordinal (byte-by-byte) comparison.

Adapting this explanation to your example:

string line = "Mìng-dĕ̤ng-ngṳ̄";   // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73 304
string sub  = "Mìng-dĕ̤ng-ngṳ";   // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73

Each .NET character represents a UTF-16 code unit, whose values are shown in the comments above. The first 14 code units are identical, which is why your char-by-char comparison evaluates to true (just like StringComparison.Ordinal). However, the 15th code unit in line is the combining macron, ◌̄ (U+0304), which combines with its preceding (U+1E73) to give ṳ̄.

Douglas
  • 53,759
  • 13
  • 140
  • 188
  • 5
    But what is a default comparer for StartsWith? Also, why do you use IgnoreCase but not just Ordinal? Author uses ordinal comparation without ignoring case. – Vadim Martynov Jan 12 '16 at 10:35
  • 3
    @VadimMartynov: I've updated the example to use `Ordinal`. Expanding explanation. – Douglas Jan 12 '16 at 10:48
  • 1
    It's also worth noting that `line1.StartsWith(line2)` is true (and vice versa), as does `line1.Equals(line2, StringComparison.InvariantCulture)` [and CurrentCulture for most reasonable settings... the single-argument Equals seems to use Ordinal] – Random832 Jan 12 '16 at 15:09
9

This is not a bug. The String.StartsWith is in fact much smarter than just a character-by-character check of your two strings. It takes into account your current culture (language settings, etc.) and it takes into account contractions and special characters. (It does not care you need two characters to end up with ṳ̄. It compares it as one).

So this means that if you don't want to take all those culture specific settings, and just want to check it using ordinal comparison, you have to tell the comparer that.

This is the correct way to do that (not ignoring the case, like Douglas did!):

line.StartsWith(sub, StringComparison.Ordinal);
Community
  • 1
  • 1
Patrick Hofman
  • 153,850
  • 22
  • 249
  • 325
  • 1
    But why wouldn't you want to take these culture-specific settings into account? I don't speak Vietnamese, should these two Strings be considered prefix-equal or not? – Thilo Jan 12 '16 at 10:41
  • That is to OP. I don't know Vietnames either, but I can understand that you sometimes want to compare one way and the next time the other way. That is why you can influence comparison using `Ordinal`. – Patrick Hofman Jan 12 '16 at 10:41
  • I understand. All I'm saying is that OP should carefully consider if that is really the way to go. – Thilo Jan 12 '16 at 10:43
  • I don't think that this is actually culture-specific - they won't be considered the same in _any_ locale. – Random832 Jan 12 '16 at 15:06
  • 3
    @Thilo do you maybe speak English or another European language? In those languages one normally considers letters and diacritics as units with a few exceptions (and most of those exceptions are considering are combining letters, not splitting them). We'd normally say `naïve` and `naï̈ve` both have 5 letters and `façade` and `façade` both have 6. .NET accordingly considers the first two to `StartsWith` each other and likewise the second two, but gives `Length` of 5, 6, 6 & 7 accordingly. – Jon Hanna Jan 12 '16 at 15:45