I retrieve strings from a website using the HttpClient class. The webserver sends them in UTF-8 encoding. The strings have the form abc | a
and I'd like to remove the pipe, the space and the character after the space from them, if they are at the end of the string.
sText = Regex.Replace (sText, @"\| .$", "");
works as expected. Now, in some cases, the pipe and the space is followed by another character, for example a smiley. The string has then the form abc |
. The regular expression above does not work and I have to use
sText = Regex.Replace (sText, @"\| ..$", "");
instead (two dots).
I'm quite sure it has something to do with the encoding and with the fact that the smiley uses more bytes in UTF-8 than a latin character - and the fact that c# doesn't know the encoding. The smiley is just one character, even if it uses more bytes, so after telling c# the correct encoding (or converting the string), the first regular expression should work in both cases.
How can this be done?