You are trying to search for byte
values but C# strings are made from char
values. The C# language spec at section "2.4.4.4 Character literals" states:
A character literal represents a single character, and usually consists of a character in quotes, as in 'a'.
...
A hexadecimal escape sequence represents a single Unicode character, with the value formed by the hexadecimal number following \x
.
Hence the search for "\xF0..."
is searching for the character U+F0
which would be represented by the bytes C3 B0
.
If you want find replace all Unicode characters whose first byte is 0xF0 then I believe you need to search for the character values whose first byte if 0xFO.
The character U+10000
is represented as F0 90 80 80
(the preceding code is U+FFFF
which is EF BF BF
). The first code with F1 .... ..
is U+40000
which is F1 80 80 80
and the value before it is U+3FFFF
which is F0 BF BF BF
.
Hence you need to remove characters in the range U+10000
to U+3FFFF
. This should be possible with a regular expression such as
sText = Regex.Replace (sText, "[\\x10000-\\x3FFFF]", "");
The relevant characters from the source quoted in the question have been extracted into the code below. The code then tries to understand how the characters are held in strings.
static void Main(string[] args)
{
string input = "] (";
Console.Write("Input length {0} : '{1}' : ", input.Length, input);
foreach (char cc in input)
{
Console.Write(" {0,2:X02}", (int)cc);
}
Console.WriteLine();
}
The output from the program is as below. This supports the surrogate pair explanation given by @Jeppe in his answer.
Input length 6 : '] ?? (' : 5D 20 D834 DD1E 20 28