Unicode character range not being consumed by Regex

Question

NOTE

Another question was asked C# Regular Expressions with \Uxxxxxxxx characters in the pattern already. This question differs in that it is not about how surrogate pairs are calculated, but how to express unicode planes higher than 0 in a regex. It should be clear from my question that I already understand why these code units are being expressed as 2 characters - they are surrogate pairs (which was what the other question is asking about). My question is how can I convert them generically (since I have no control over what the regex being fed to the program looks like) so they can be consumed by the .NET Regex engine.

Note I now have a way to do this and would like to add my answer to my question, but since this is now marked as a duplicate I cannot add my answer.

I have some test data that is being passed to a Java library that I am porting to c#. I have isolated a specific problem case as an example. The character class in the original was in UTF-32 = \U0001BCA0-\U0001BCA3, which is not readily consumable by .NET - we get an "Unrecognized escape sequence \U" error.

I attempted to convert to UTF-16 and I have confirmed the results for \U0001BCA0 and \U0001BCA3 are what should be expected.

UTF-32      | Codepoint   | High Surrogate  | Low Surrogate  | UTF-16
---------------------------------------------------------------------------
0x0001BCA0  | 113824      | 55343           | 56480          | \uD82F\uDCA0
0x0001BCA3  | 113827      | 55343           | 56483          | \uD82F\uDCA3

However, when I pass the string "([\uD82F\uDCA0-\uD82F\uDCA3])" to the constructor of the Regex class, I get an exception "[x-y] range in reverse order".

Although it is pretty clear the characters are specified in the right order (it works in Java), I tried in reverse and got the same error message.

I also tried changing the UTF-32 characters from \U0001BCA0-\U0001BCA3 to \x01BCA0-\x01BCA3, but still get the exception "[x-y] range in reverse order".

So, how do I get the .NET Regex class to parse this character range successfully?

NOTE: I tried changing the code to generate a regex character class that includes all of the characters instead of a range and it seems to work, but that is going to turn my regexes that are a few dozen characters into several thousand characters, which surely isn't going to do wonders for performance.

Actual Regex Example

Again, the above is an isolated example of a failure in a much larger string. What I am looking for is a general way to convert regexes like these so they can be parsed by the .NET Regex class.

"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

This is absolutely not a duplicate. The question is not how surrogate pairs are calculated, but how to express unicode planes higher than 0 in a regex. — Sefe, Dec 02 '17 at 11:50
@WiktorStribiżew - please reopen my question so I can add my answer to it. This is not a duplicate of the linked question. — NightOwl888, Dec 03 '17 at 13:48

Sefe · Answer 1 · 2017-12-02T07:30:37.910

You assume that Regex will recognize "\uD82F\uDCA0" as a compound character. That is not the case, since the internal representation of a string in .NET is 16 bit Unicode.

Unicode has the concept of code points which is an abstract concept that is independent of the physical representation. Depending on the actual encoding used, not all code points can be displayed in one character. In UTF-8 this becomes very obvious, since all code points above 127 will need two or more characters. In .NET the characters are Unicode, which means for planes higher than 0 you need combining characters. These are though still recognized as individual characters by the regex engine.

Long story short: don't treat character combinations as code points, treat them as individual characters. So in your case the regex would be:

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var regex = new Regex("(\uD82F[\uDCA0-\uDCA3])");
        Console.WriteLine(regex.Match("\uD82F\uDCA2").Success);
    }
}

You can try out the code here.

What would I need to do for ranges that don't happen to specify the same high surrogate character? Again, the example is an isolated case. My actual strings have character classes with many codepoint ranges specified in them. — NightOwl888, Dec 02 '17 at 08:37
If you need to do this in Regex, you have to split up your ranges to sub ranges and use `(range1|range2)`. If you are open to a non-regex solution, you could convert this into binary with `Encoding.UTF32` and search in the binary for the code points. Note that for the Regex solution you need maximum 3 sub-ranges per code point range. — Sefe, Dec 02 '17 at 09:18

CodeFuller · Answer 2 · 2017-12-02T06:28:44.483

1

Strings in C# are UTF-16 encoded. That's why this regex is treated as:

Symbol '\uD82F' or
Range \uDCA0-\uD82F or
Symbol '\uDCA3'

The range \uDCA0-\uD82F is obviously incorrect and causes [x-y] range in reverse order exception.

Unfortunately there is no easy solution for you problem because it's caused by a nature of C# strings. You can't fit UTF-32 symbol into one C# character and you can't use multi-character strings as range borders.

The possible workaround is to use semi-regex solution: extract such symbols from the string and perform comparing by pure C# code. Of course it seems ugly, but I don't see another way to accomplish this with raw regex in C#.

edited Dec 02 '17 at 06:28

answered Dec 02 '17 at 06:22

CodeFuller

30,317
3
63
79

Thanks. At least now I have a reasonable explanation as to why it is happening. However, the entire gist of the code is that it uses a set of file-driven rules to build a regex and then it compares that regex against the production code to ensure it is working the same way. I am going to have to ponder how to best handle this. – NightOwl888 Dec 02 '17 at 06:41

score 1 · Accepted Answer · answered Dec 04 '17 at 05:39

While the other contributors to this question provided some clues, I needed an answer. My test is a rules engine that is driven by a regex that is built up from file input, so hard coding the logic into C# is not an option.

However, I did learn here that

the .NET Regex class does not support surrogate pairs and
you can fake support for surrogate pair ranges by using regex alteration

But of course, in my data-driven case I can't manually change the regexes to a format that .NET will accept - I need to automate it. So, I created the below Utf32Regex class that accepts UTF32 characters directly in the constructor and internally converts them to regexes that .NET understands.

For example, it will convert the regex

"[abc\\U00011DEF-\\U00013E07]"

To

"(?:[abc]|\\uD807[\\uDDEF-\\uDFFF]|[\\uD808-\\uD80E][\\uDC00-\\uDFFF]|\\uD80F[\\uDC00-\\uDE07])"

Or

"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

To

"((?:[\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD\\u061C\\u180E" + 
"\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-\\uDFFF\\uFEFF\\uFFF0-\\uFFFB]|" + 
"\\uD82F[\\uDCA0-\\uDCA3]|\\uD834[\\uDD73-\\uDD7A]|\\uDB40[\\uDC00-\\uDC1F]|" + 
"\\uDB40[\\uDC80-\\uDCFF]|\\uDB40[\\uDDF0-\\uDFFF]|[\\uDB41-\\uDB42][\\uDC00-\\uDFFF]|" + 
"\\uDB43[\\uDC00-\\uDFFF]) | [\\u000D] | [\\u000A]) ()"

Utf32Regex.cs

using System;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;

/// <summary>
/// Patches the <see cref="Regex"/> class so it will automatically convert and interpret
/// UTF32 characters expressed like <c>\U00010000</c> or UTF32 ranges expressed
/// like <c>\U00010000-\U00010001</c>.
/// </summary>
public class Utf32Regex : Regex
{
    private const char MinLowSurrogate = '\uDC00';
    private const char MaxLowSurrogate = '\uDFFF';

    private const char MinHighSurrogate = '\uD800';
    private const char MaxHighSurrogate = '\uDBFF';

    // Match any character class such as [A-z]
    private static readonly Regex characterClass = new Regex(
        "(?<!\\\\)(\\[.*?(?<!\\\\)\\])",
        RegexOptions.Compiled);

    // Match a UTF32 range such as \U000E01F0-\U000E0FFF
    // or an individual character such as \U000E0FFF
    private static readonly Regex utf32Range = new Regex(
        "(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})-(?<end>\\\\U(?:00)?[0-9A-Fa-f]{6})|(?<begin>\\\\U(?:00)?[0-9A-Fa-f]{6})",
        RegexOptions.Compiled);

    public Utf32Regex()
        : base()
    {
    }

    public Utf32Regex(string pattern)
        : base(ConvertUTF32Characters(pattern))
    {
    }

    public Utf32Regex(string pattern, RegexOptions options)
        : base(ConvertUTF32Characters(pattern), options)
    {
    }

    public Utf32Regex(string pattern, RegexOptions options, TimeSpan matchTimeout)
        : base(ConvertUTF32Characters(pattern), options, matchTimeout)
    {
    }

    private static string ConvertUTF32Characters(string regexString)
    {
        StringBuilder result = new StringBuilder();
        // Convert any UTF32 character ranges \U00000000-\U00FFFFFF to their
        // equivalent UTF16 characters
        ConvertUTF32CharacterClassesToUTF16Characters(regexString, result);
        // Now find all of the individual characters that were not in ranges and
        // fix those as well.
        ConvertUTF32CharactersToUTF16(result);

        return result.ToString();
    }

    private static void ConvertUTF32CharacterClassesToUTF16Characters(string regexString, StringBuilder result)
    {
        Match match = characterClass.Match(regexString); // Reset
        int lastEnd = 0;
        if (match.Success)
        {
            do
            {
                string characterClass = match.Groups[1].Value;
                string convertedCharacterClass = ConvertUTF32CharacterRangesToUTF16Characters(characterClass);

                result.Append(regexString.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
                result.Append(convertedCharacterClass); // Append replacement 

                lastEnd = match.Index + match.Length;
            } while ((match = match.NextMatch()).Success);
        }
        result.Append(regexString.Substring(lastEnd)); // Append tail
    }

    private static string ConvertUTF32CharacterRangesToUTF16Characters(string characterClass)
    {
        StringBuilder result = new StringBuilder();
        StringBuilder chars = new StringBuilder();

        Match match = utf32Range.Match(characterClass); // Reset
        int lastEnd = 0;
        if (match.Success)
        {
            do
            {
                string utf16Chars;
                string rangeBegin = match.Groups["begin"].Value.Substring(2);

                if (!string.IsNullOrEmpty(match.Groups["end"].Value))
                {
                    string rangeEnd = match.Groups["end"].Value.Substring(2);
                    utf16Chars = UTF32RangeToUTF16Chars(rangeBegin, rangeEnd);
                }
                else
                {
                    utf16Chars = UTF32ToUTF16Chars(rangeBegin);
                }

                result.Append(characterClass.Substring(lastEnd, match.Index - lastEnd)); // Remove the match
                chars.Append(utf16Chars); // Append replacement 

                lastEnd = match.Index + match.Length;
            } while ((match = match.NextMatch()).Success);
        }
        result.Append(characterClass.Substring(lastEnd)); // Append tail of character class

        // Special case - if we have removed all of the contents of the
        // character class, we need to remove the square brackets and the
        // alternation character |
        int emptyCharClass = result.IndexOf("[]");
        if (emptyCharClass >= 0)
        {
            result.Remove(emptyCharClass, 2);
            // Append replacement ranges (exclude beginning |)
            result.Append(chars.ToString(1, chars.Length - 1));
        }
        else
        {
            // Append replacement ranges
            result.Append(chars.ToString());
        }

        if (chars.Length > 0)
        {
            // Wrap both the character class and any UTF16 character alteration into
            // a non-capturing group.
            return "(?:" + result.ToString() + ")";
        }
        return result.ToString();
    }

    private static void ConvertUTF32CharactersToUTF16(StringBuilder result)
    {
        while (true)
        {
            int where = result.IndexOf("\\U00");
            if (where < 0)
            {
                break;
            }
            string cp = UTF32ToUTF16Chars(result.ToString(where + 2, 8));
            result.Replace(where, where + 10, cp);
        }
    }

    private static string UTF32RangeToUTF16Chars(string hexBegin, string hexEnd)
    {
        var result = new StringBuilder();
        int beginCodePoint = int.Parse(hexBegin, NumberStyles.HexNumber);
        int endCodePoint = int.Parse(hexEnd, NumberStyles.HexNumber);

        var beginChars = char.ConvertFromUtf32(beginCodePoint);
        var endChars = char.ConvertFromUtf32(endCodePoint);
        int beginDiff = endChars[0] - beginChars[0];

        if (beginDiff == 0)
        {
            // If the begin character is the same, we can just use the syntax \uD807[\uDDEF-\uDFFF]
            result.Append("|");
            AppendUTF16Character(result, beginChars[0]);
            result.Append('[');
            AppendUTF16Character(result, beginChars[1]);
            result.Append('-');
            AppendUTF16Character(result, endChars[1]);
            result.Append(']');
        }
        else
        {
            // If the begin character is not the same, create 3 ranges
            // 1. The remainder of the first
            // 2. A range of all of the middle characters
            // 3. The beginning of the last

            result.Append("|");
            AppendUTF16Character(result, beginChars[0]);
            result.Append('[');
            AppendUTF16Character(result, beginChars[1]);
            result.Append('-');
            AppendUTF16Character(result, MaxLowSurrogate);
            result.Append(']');

            // We only need a middle range if the ranges are not adjacent
            if (beginDiff > 1)
            {
                result.Append("|");
                // We only need a character class if there are more than 1
                // characters in the middle range
                if (beginDiff > 2)
                {
                    result.Append('[');
                }
                AppendUTF16Character(result, (char)(Math.Min(beginChars[0] + 1, MaxHighSurrogate)));
                if (beginDiff > 2)
                {
                    result.Append('-');
                    AppendUTF16Character(result, (char)(Math.Max(endChars[0] - 1, MinHighSurrogate)));
                    result.Append(']');
                }
                result.Append('[');
                AppendUTF16Character(result, MinLowSurrogate);
                result.Append('-');
                AppendUTF16Character(result, MaxLowSurrogate);
                result.Append(']');
            }

            result.Append("|");
            AppendUTF16Character(result, endChars[0]);
            result.Append('[');
            AppendUTF16Character(result, MinLowSurrogate);
            result.Append('-');
            AppendUTF16Character(result, endChars[1]);
            result.Append(']');
        }
        return result.ToString();
    }

    private static string UTF32ToUTF16Chars(string hex)
    {
        int codePoint = int.Parse(hex, NumberStyles.HexNumber, CultureInfo.InvariantCulture);
        return UTF32ToUTF16Chars(codePoint);
    }

    private static string UTF32ToUTF16Chars(int codePoint)
    {
        StringBuilder result = new StringBuilder();
        UTF32ToUTF16Chars(codePoint, result);
        return result.ToString();
    }

    private static void UTF32ToUTF16Chars(int codePoint, StringBuilder result)
    {
        // Use regex alteration to on the entire range of UTF32 code points
        // to ensure each one is treated as a group.
        result.Append("|");
        AppendUTF16CodePoint(result, codePoint);
    }

    private static void AppendUTF16CodePoint(StringBuilder text, int cp)
    {
        var chars = char.ConvertFromUtf32(cp);
        AppendUTF16Character(text, chars[0]);
        if (chars.Length == 2)
        {
            AppendUTF16Character(text, chars[1]);
        }
    }

    private static void AppendUTF16Character(StringBuilder text, char c)
    {
        text.Append(@"\u");
        text.Append(Convert.ToString(c, 16).ToUpperInvariant());
    }
}

StringBuilderExtensions.cs

public static class StringBuilderExtensions
{
    /// <summary>
    /// Searches for the first index of the specified character. The search for
    /// the character starts at the beginning and moves towards the end.
    /// </summary>
    /// <param name="text">This <see cref="StringBuilder"/>.</param>
    /// <param name="value">The string to find.</param>
    /// <returns>The index of the specified character, or -1 if the character isn't found.</returns>
    public static int IndexOf(this StringBuilder text, string value)
    {
        return IndexOf(text, value, 0);
    }

    /// <summary>
    /// Searches for the index of the specified character. The search for the
    /// character starts at the specified offset and moves towards the end.
    /// </summary>
    /// <param name="text">This <see cref="StringBuilder"/>.</param>
    /// <param name="value">The string to find.</param>
    /// <param name="startIndex">The starting offset.</param>
    /// <returns>The index of the specified character, or -1 if the character isn't found.</returns>
    public static int IndexOf(this StringBuilder text, string value, int startIndex)
    {
        if (text == null)
            throw new ArgumentNullException("text");
        if (value == null)
            throw new ArgumentNullException("value");

        int index;
        int length = value.Length;
        int maxSearchLength = (text.Length - length) + 1;

        for (int i = startIndex; i < maxSearchLength; ++i)
        {
            if (text[i] == value[0])
            {
                index = 1;
                while ((index < length) && (text[i + index] == value[index]))
                    ++index;

                if (index == length)
                    return i;
            }
        }

        return -1;
    }

    /// <summary>
    /// Replaces the specified subsequence in this builder with the specified
    /// string.
    /// </summary>
    /// <param name="text">this builder.</param>
    /// <param name="start">the inclusive begin index.</param>
    /// <param name="end">the exclusive end index.</param>
    /// <param name="str">the replacement string.</param>
    /// <returns>this builder.</returns>
    /// <exception cref="IndexOutOfRangeException">
    /// if <paramref name="start"/> is negative, greater than the current
    /// <see cref="StringBuilder.Length"/> or greater than <paramref name="end"/>.
    /// </exception>
    /// <exception cref="ArgumentNullException">if <paramref name="str"/> is <c>null</c>.</exception>
    public static StringBuilder Replace(this StringBuilder text, int start, int end, string str)
    {
        if (str == null)
        {
            throw new ArgumentNullException(nameof(str));
        }
        if (start >= 0)
        {
            if (end > text.Length)
            {
                end = text.Length;
            }
            if (end > start)
            {
                int stringLength = str.Length;
                int diff = end - start - stringLength;
                if (diff > 0)
                { // replacing with fewer characters
                    text.Remove(start, diff);
                }
                else if (diff < 0)
                {
                    // replacing with more characters...need some room
                    text.Insert(start, new char[-diff]);
                }
                // copy the chars based on the new length
                for (int i = 0; i < stringLength; i++)
                {
                    text[i + start] = str[i];
                }
                return text;
            }
            if (start == end)
            {

                text.Insert(start, str);
                return text;
            }
        }
        throw new IndexOutOfRangeException();
    }
}

Do note this is not very well tested and probably not very robust, but for testing purposes it should be fine.

Unicode character range not being consumed by Regex

Actual Regex Example

3 Answers3

Utf32Regex.cs

StringBuilderExtensions.cs

Linked

Related