NUnit - how to compare strings containing composite Unicode characters?

Question

I'm using NUnit v2.5 to compare strings that contain composite Unicode characters.
Although comparison itself works fine, a caret indicating first difference seems to be misplaced.

UPD: I've ended up with overridden EqualConstraint that in turn invokes a custom TextMessageWriter, so I no longer need an answer. See for solution below.

Here's the snippet:

string s1 = "ใช้งานง่าย";
string s2 = "ใช้งานงาย";
Assert.That(s1, Is.EqualTo(s2));

Here's the output:

Expected: "ใช้งานงาย"
But was:  "ใช้งานง่าย"
------------------^

The arrow indicating first different character seems to be off 2 positions (as many as there are tone marks above). For longer strings, it becomes a real pain.
I have attempted String.Normalize() but it wouldn't work either.

~~How can I overcome this problem?~~ Thanks for your help. See my answer below.

score 1 · Answer 1 · answered Feb 28 '12 at 21:24

1

When you are comparing Unicode strings, you must always normalize both sides of the comparison, and in the same way. It is not good enough to do binary compare of s1 and s2, because canonically equivalent strings would not test binary equivalent.

Positing the existence of four trivial normalization function, one for each of the four normalization forms, you would want to test NFD(s1) for binary eqality to NFD(s2). It doesn't matter whether you use NFD or NFC there, but you must do the same thing to both strings.

For the k-compat functions, NFKD and NFKD, those are useful when doing string searching, because they improve the recall at the cost of some precision. For example NFKD("™") would be equal to NFKD("TM"). This is what Adobe Reader does, for example, when you run searches on documents: it always runs the search in k-compat mode, so that your searches have a better chance at finding things. However, unlike NFC and NFD, the k-compat functions NFKC and NFKD lose information and are not reversible. With simple NFD and NFC, though, you can always get back to the other one.

answered Feb 28 '12 at 21:24

tchrist

78,834
30
123
180

Thank you for idea. Comparison itself work as expected; I assume it is always binary-wise. The problem is only related an indicating arrow. NUnit seems to calculate its length based on char offset. Both "ใช้" and "ใช" take two placeholders while string lengths are 3 and 2, respectively. Also, please note that the difference occur much further, but this word (the first one in sample) also contributes to wrong offset. – Be Brave Be Like Ukraine Feb 28 '12 at 23:15
@bytebuster One of the strings has U+0348 `THAI CHARACTER MAI EK` in it which the other one lacks. This character is both a `\p{Diacritic}` and a `\p{Nonspacing_Mark}` character. A string that has it in there is no more going to be the same as one without it then would *café* or *niño* be the same as *cafe* and *nino*. Unicode does indeed have a type of equality test that is tolerant of this difference, but is not normalization. Rather, it is comparing them at the primary-strength only using the Unicode Collation Algorithm. So you need to binary-compare UCA1(s1) with UCA1(s2). – tchrist Feb 28 '12 at 23:23
I understand what you mean in terms of Unicode, but I cannot think how this can help for NUnit. In two words: the only thing I need is that the arrow was by two characters shorter, e.g. `--^` instead of `----^`. Can you please elaborate your suggestion? – Be Brave Be Like Ukraine Feb 28 '12 at 23:37
@bytebuster My suggestion is that you compare them using the primary strength only of the UCA. That discounts diacritics and case distinction. Otherwise you'll have to resort to some mapping function that discards information like Mark characters, which may be wrong for this purpose. – tchrist Feb 28 '12 at 23:44

score 0 · Accepted Answer · answered Jun 29 '12 at 00:34

I think I cannot find any better answer, so answering my own question.

Cause.
There are many languages using non-spacing modifiers for characters. For European languages, there are substitutions, e.g. "u" (U+0075) + "¨" (U+00A8) = "ü" (U+00FC). In this case, solution by @tchrist is quite sufficient.

However, for complex writing systems, there is no substitution for non-spacing modifiers. Therefore, NUnit's TextMessageWriter.WriteCaretLine(int mismatch) treats mismatch parameter as a byte offset, while screen representation of Thai string may be shorter than the length of caret line ("-----^").

Solution.
Force WriteCaretLine(int mismatch) to respect non-spacing modifiers, reducing mismatch value to the number of non-spacing modifiers occurred before this offset.
Implement all supplementary classes that are actually needed only to make your new code invoked.

Along with Thai, I have tested it with Devanagari and Tibetan. It works as expected.

Yet another pitfall. If you're using NUnit with Visual Studio through ReSharper like I do, you have to configure your Internet Explorer's fonts (it cannot be managed with R#) so that it used proper monospaced fonts for Thai, Devanagari, etc.

Implementation.

Inherit TextMessageWriter and override its DisplayStringDifferences;
Implement your own ClipExpectedAndActual and FindMismatchPosition - here are non-spacing modifiers are respected; Proper clipping is needed since it may also impact calculation of non-spacing elements.
Inherit EqualConstraint and override its WriteMessageTo(MessageWriter writer) so that your MessageWriter was used;
Optionally, create a custom wrapper for simple invocation of custom constraint.

The source code goes below. About 80% of the code doesn't do anything useful, but it's included due to access levels in original code.

// Step 1.
public class ThaiMessageWriter : TextMessageWriter
{
    /// <summary>
    /// This method is merely a copy of the original method taken from NUnit sources,
    /// except that it changes meaning of <paramref name="mismatch"/> before the caret line is displayed.
    /// <remarks>
    /// Originally passed <paramref name="mismatch"/> contains byte offset, while proper display of caret requires
    /// it position to be calculated in character placeholder units. They are different in case of
    /// over- or under-string Unicode characters like acute mark or complex script (Thai)
    /// </remarks> 
    /// </summary>
    /// <param name="clipping"></param>
    public override void DisplayStringDifferences(string expected, string actual, int mismatch, bool ignoreCase, bool clipping)
    {
        // Maximum string we can display without truncating
        int maxDisplayLength = MaxLineLength
                               - PrefixLength   // Allow for prefix
                               - 2;             // 2 quotation marks

        int mismatchOffset = mismatch;

        if (clipping)
            MsgUtils2.ClipExpectedAndActual(ref expected, ref actual, maxDisplayLength, mismatchOffset);

        expected = MsgUtils.EscapeControlChars(expected);
        actual = MsgUtils.EscapeControlChars(actual);

        // The mismatch position may have changed due to clipping or white space conversion
        int mismatchInCharPlaceholders = MsgUtils2.FindMismatchPosition(expected, actual, 0, ignoreCase);

        Write(Pfx_Expected);
        WriteExpectedValue(expected);
        if (ignoreCase)
            WriteModifier("ignoring case");
        WriteLine();
        WriteActualLine(actual);
        //DisplayDifferences(expected, actual);
        if (mismatch >= 0)
            WriteCaretLine(mismatchInCharPlaceholders);

    }

    // Copied due to private
    /// <summary>
    /// Write the generic 'Actual' line for a constraint
    /// </summary>
    /// <param name="constraint">The constraint for which the actual value is to be written</param>
    private void WriteActualLine(Constraint constraint)
    {
        Write(Pfx_Actual);
        constraint.WriteActualValueTo(this);
        WriteLine();
    }

    // Copied due to private
    /// <summary>
    /// Write the generic 'Actual' line for a given value
    /// </summary>
    /// <param name="actual">The actual value causing a failure</param>
    private void WriteActualLine(object actual)
    {
        Write(Pfx_Actual);
        WriteActualValue(actual);
        WriteLine();
    }

    // Copied due to private
    private void WriteCaretLine(int mismatch)
    {
        // We subtract 2 for the initial 2 blanks and add back 1 for the initial quote
        WriteLine("  {0}^", new string('-', PrefixLength + mismatch - 2 + 1));
    }
}

// Step 2.
public static class MsgUtils2
{
    private static readonly string ELLIPSIS = "...";

    /// <summary>
    ///  Almost a copy of MsgUtil.ClipExpectedAndActual method
    /// </summary>
    /// <param name="expected"></param>
    /// <param name="actual"></param>
    /// <param name="maxDisplayLength"></param>
    /// <param name="mismatch"></param>
    public static void ClipExpectedAndActual(ref string expected, ref string actual, int maxDisplayLength, int mismatch)
    {
        // Case 1: Both strings fit on line
        int maxStringLength = Math.Max(expected.Length, actual.Length);
        if (maxStringLength <= maxDisplayLength)
            return;

        // Case 2: Assume that the tail of each string fits on line
        int clipLength = maxDisplayLength - ELLIPSIS.Length;
        int clipStart = maxStringLength - clipLength;

        // Case 3: If it doesn't, center the mismatch position
        if (clipStart > mismatch)
            clipStart = Math.Max(0, mismatch - clipLength / 2);

        // shift both clipStart and maxDisplayLength if they split non-placeholding symbol
        AdjustForNonPlaceholdingCharacter(expected, ref clipStart);
        AdjustForNonPlaceholdingCharacter(expected, ref maxDisplayLength);

        expected = MsgUtils.ClipString(expected, maxDisplayLength, clipStart);
        actual = MsgUtils.ClipString(actual, maxDisplayLength, clipStart);
    }

    private static void AdjustForNonPlaceholdingCharacter(string expected, ref int index)
    {

        while (index > 0 && CharUnicodeInfo.GetUnicodeCategory(expected[index]) == UnicodeCategory.NonSpacingMark)
        {
            index--;
        }
    }

    static public int FindMismatchPosition(string expected, string actual, int istart, bool ignoreCase)
    {
        int length = Math.Min(expected.Length, actual.Length);

        string s1 = ignoreCase ? expected.ToLower() : expected;
        string s2 = ignoreCase ? actual.ToLower() : actual;

        int iSpacingCharacters = 0;
        for (int i = 0; i < istart; i++)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(s1[i]) != UnicodeCategory.NonSpacingMark)
                iSpacingCharacters++;
        }
        for (int i = istart; i < length; i++)
        {
            if (s1[i] != s2[i])
                return iSpacingCharacters;
            if (CharUnicodeInfo.GetUnicodeCategory(s1[i]) != UnicodeCategory.NonSpacingMark)
                iSpacingCharacters++;
        }

        //
        // Strings have same content up to the length of the shorter string.
        // Mismatch occurs because string lengths are different, so show
        // that they start differing where the shortest string ends
        //
        if (expected.Length != actual.Length)
            return length;

        //
        // Same strings : We shouldn't get here
        //
        return -1;
    }
}

// Step 3.
public class ThaiEqualConstraint : EqualConstraint
{
    private readonly string _expected;

    // WTF expected is private?
    public ThaiEqualConstraint(string expected) : base(expected)
    {
        _expected = expected;
    }

    public override void WriteMessageTo(MessageWriter writer)
    {
        // redirect output to customized MessageWriter
        var myMessageWriter = new ThaiMessageWriter();
        base.WriteMessageTo(myMessageWriter);
        writer.Write(myMessageWriter);
    }
}

// Step 4.
public static class ThaiText
{
    public static EqualConstraint IsEqual(string expected)
    {
        return new ThaiEqualConstraint(expected);
    }
}

score 0 · Answer 3 · edited May 23 '17 at 09:59

0

You should be able to use the code from this answer to convert each string to an escaped version of the original string. Composite characters will become a single escaped \u codepoint, while combining characters will be a series of such escapes. Then run your Assert on these escaped versions of the string.

edited May 23 '17 at 09:59

Community

1
1

answered Feb 28 '12 at 21:11

beerbajay

19,652
6
58
75

Unfortunately, this is not an option, either. An arrow _correctly_ pointing in a middle of hex dump would be even more difficult to interpret, comparing to a _misplaced_ arrow pointing to original text... – Be Brave Be Like Ukraine Feb 28 '12 at 23:28

NUnit - how to compare strings containing composite Unicode characters?

3 Answers3