remove 4 byte UTF8 characters

Question

I'd like to remove 4 byte UTF8 characters which starts with \xF0 (the char with the ASCII code 0xF0) from a string and tried

sText = Regex.Replace (sText, "\xF0...", "");

This doesn't work. Using two backslashes did not work neither.

The exact input is the content of https://de.wikipedia.org/w/index.php?title=Spezial:Exportieren&action=submit&pages=Unicode The 4 byte character ist the one after the text "[[Violinschlüssel]] ", in hex notation: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. The expected output is 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..

What's wrong?

Maybe because you tried to remove [`ð` character](https://ideone.com/YizDeh). What is your exact input and exact expected output? — Wiktor Stribiżew, Aug 02 '16 at 07:51
Comments are for _us_ to ask _you_ for clarification. Please put your clarifications in the question itself, by clicking the [edit](https://stackoverflow.com/posts/38714663/edit) link and updating your post. — Peter Duniho, Aug 02 '16 at 08:02
This is a good question. It concerns the non-obvious relation between bytes, characters and strings in C#. — AdrianHHH, Aug 02 '16 at 08:21
The exact input is not clear, please put it into a string literal. For now, have a look at https://ideone.com/IDPqHP — Wiktor Stribiżew, Aug 02 '16 at 08:27
I'm curious, why are the Unicode codepoints that UTF-8 encodes to 4-bytes (U+10000 to U+10FFFF) special to you? And, why do you describe them in terms of bytes and not codepoint ranges or Unicode blocks (or—not quite the same thing—Unicode categories)? — Tom Blodget, Aug 02 '16 at 16:53
I need to store the data in a MySQL database with "UTF8" encoding (which can't be changed for the moment). Please see http://stackoverflow.com/questions/10957238/incorrect-string-value-when-trying-to-insert-utf-8-into-mysql-via-jdbc — André, Aug 02 '16 at 19:54

Jeppe Stig Nielsen · Accepted Answer · 2016-09-22T13:19:17.227

5

Such characters will be surrogate pairs in .NET which uses UTF-16. Each of them will be two UTF-16 code units, that is two char values.

To just remove them, you can do (using System.Linq;):

sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));

(uses an overload of Concat introduced in .NET 4.0 (Visual Studio 2010)).

Late addition: It may give better performance to use:

sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());

even if it looks worse. (Works in .NET 3.5 (Visual Studio 2008).)

edited Sep 22 '16 at 13:19

answered Aug 02 '16 at 09:15

Jeppe Stig Nielsen

60,409
11
110
181

As far as I understand it removes all 3 and 4 byte UTF8 characters (which are 2 UTF16 char values in C# strings). This is not exactly what I asked for, but I found out that this is exactly what I really require. Thanks again. – André Aug 02 '16 at 09:45
@André You are wrong. If you want to remove characters that correspond to 3 byte UTF-8 or longer, just use `sText = string.Concat(sText.Where(x => x < '\u0800'));`. UTF-8 may be used in files, but it is not used by .NET or Windows once the `string` is in memory. If a character requires 1, 2, or 3 bytes in UTF-8, it can fit in one single _code unit_ (that is one single `char` value) in UTF-16 which is the encoding used internally by .NET and Windows. If a character requires 4 bytes in UTF-8, it needs two UTF-16 _code units_ (so _two_ `char` values); these two make up a "surrogate pair". – Jeppe Stig Nielsen Aug 02 '16 at 10:01

AdrianHHH · Answer 2 · 2016-08-02T09:26:57.503

You are trying to search for byte values but C# strings are made from char values. The C# language spec at section "2.4.4.4 Character literals" states:

A character literal represents a single character, and usually consists of a character in quotes, as in 'a'.
...
A hexadecimal escape sequence represents a single Unicode character, with the value formed by the hexadecimal number following \x.

Hence the search for "\xF0..." is searching for the character U+F0 which would be represented by the bytes C3 B0.

If you want find replace all Unicode characters whose first byte is 0xF0 then I believe you need to search for the character values whose first byte if 0xFO.

The character U+10000 is represented as F0 90 80 80 (the preceding code is U+FFFF which is EF BF BF). The first code with F1 .... .. is U+40000 which is F1 80 80 80 and the value before it is U+3FFFF which is F0 BF BF BF.

Hence you need to remove characters in the range U+10000 to U+3FFFF. This should be possible with a regular expression such as

sText = Regex.Replace (sText, "[\\x10000-\\x3FFFF]", "");

The relevant characters from the source quoted in the question have been extracted into the code below. The code then tries to understand how the characters are held in strings.

static void Main(string[] args)
{
    string input = "]  (";
    Console.Write("Input length  {0} : '{1}'  : ", input.Length, input);
    foreach (char cc in input)
    {
        Console.Write("  {0,2:X02}", (int)cc);
    }
    Console.WriteLine();
}

The output from the program is as below. This supports the surrogate pair explanation given by @Jeppe in his answer.

Input length  6 : '] ?? ('  :   5D  20  D834  DD1E  20  28

@Qix Why do you want to modify a direct quotation from the language standard? The quoted section does not have any bold text and its uses string quotes. Please explain. — AdrianHHH, Aug 02 '16 at 08:24
Because it better emphasizes your point. It's not changing the meaning of the spec. I had to search for the reason why you were including the notation of a single character and had to search for the _real_ answer, which is the distinction between a unicode `char` and a single `byte`. — Qix - MONICA WAS MISTREATED, Aug 02 '16 at 08:30
@Qix The first sentence of my answer refers to the difference between `char` and `byte` in C. — AdrianHHH, Aug 02 '16 at 08:35
Thanks a lot. This most likely points into the right direction, but your solution still doesn't work. It removes a lot of characters from the input, but not the 4 byte UTF8 characters. Even `Regex.Replace (sText, "\\x1D11E", "")` does not remove the precise single character from the input. — André, Aug 02 '16 at 08:43

remove 4 byte UTF8 characters

2 Answers2