Matching Unicode characters in a regular expression

Question

I retrieve strings from a website using the HttpClient class. The webserver sends them in UTF-8 encoding. The strings have the form abc | a and I'd like to remove the pipe, the space and the character after the space from them, if they are at the end of the string.

sText = Regex.Replace (sText, @"\| .$", "");

works as expected. Now, in some cases, the pipe and the space is followed by another character, for example a smiley. The string has then the form abc | . The regular expression above does not work and I have to use

sText = Regex.Replace (sText, @"\| ..$", "");

instead (two dots).

I'm quite sure it has something to do with the encoding and with the fact that the smiley uses more bytes in UTF-8 than a latin character - and the fact that c# doesn't know the encoding. The smiley is just one character, even if it uses more bytes, so after telling c# the correct encoding (or converting the string), the first regular expression should work in both cases.

How can this be done?

There is quite a problem with matching emojis with regex in .NET, as there is no `\p{Emoji}` construct. All you can do is define the [regex for any emoji](https://stackoverflow.com/a/48148218/3832970) or any byte (`.`). Or, you may work around it if you know what kind of chars do not appear in the string and use that to build the end of string pattern. — Wiktor Stribiżew, Aug 13 '21 at 19:34
Wiktor @Magnetron is (almost) right in his (unfairly downvoted) deleted answer. `Regex.Replace(sText, @"\| (\p{Cs}{2}|.)$", "");` should work as internal encoding in `.NET` is `UTF-16` and all chars above BMP are always two surrogates… — JosefZ, Aug 13 '21 at 20:04
The smiley was just an example. I'd like to remove everything that _looks_ like one item (one character, one digit, one symbol, ..). \p{Cs}{2} is probably too limited. — André, Aug 14 '21 at 17:51

score 1 · Answer 1 · answered Aug 22 '21 at 19:45

Like it was suggested in the comments, this problem is hard to solve using Regex. What you call "looks like one item" is actually a grapheme cluster. The corresponding .NET term is a "text element" that can be parsed and iterated through using StringInfo.GetTextElementEnumerator.

A possible solution based on text elements can be quite simple: we just need to extract the last 3 text elements from the input string and ensure that they refer to a pipe, a space and the last one can be any. Please find below the proposed approach implementation.

void Main()
{
    var inputs = new[] {
        "abc | a",
        "abc | ab", // The only that shouldn't be trimmed
        "abc | ",
        "abc | " + "\uD83D\uDD75\u200D\u2642\uFE0F" // "man-detective" (on Windows)
    };
    
    foreach (var input in inputs)
    {
        var res = TrimTrailingTextElement(input);

        Console.WriteLine("Input : " + input);
        Console.WriteLine("Result: " + res);
        Console.WriteLine();
    }
}

string TrimTrailingTextElement(string input)
{
    // A circular buffer for storing the last 3 text elements
    var lastThreeElementIdxs = new int[3] { -1, -1, -1 };
    
    // Get enumerator of text elements in the input string
    var enumerator = StringInfo.GetTextElementEnumerator(input);

    // Iterate through the enitre input string,
    // at each step save to the buffer the current element index
    var i = -1;
    while (enumerator.MoveNext())
    {
        i = (i + 1) % 3;
        lastThreeElementIdxs[i] = enumerator.ElementIndex;
    }

    // The buffer index must be positive for a non-empty input
    if (i >= 0)
    {
        // Extract indexes of the last 3 elements
        // from the circular buffer
        var i1 = lastThreeElementIdxs[(i + 1) % 3];
        var i2 = lastThreeElementIdxs[(i + 2) % 3];
        var i3 = lastThreeElementIdxs[i];

        if (i1 >= 0 && i2 >= 0 && i3 >= 0 && // All 3 indexes must be initialized
            i3 - i2 == 1 && i2 - i1 == 1 &&  // The 1 and 2 elements must be 1 char long
            input[i1] == '|' &&              // The 1 element must be a pipe 
            input[i2] == ' ')                // The 2 element must be a space
        {
            return input.Substring(0, i1);
        }
    }
    
    return input;
}

Matching Unicode characters in a regular expression

1 Answers1