14

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see

=E2=80=8B

at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.

What is the easiest way to get rid of this exact sequence? I cannot do the obvious

MailItem.Body.Replace("=E2=80=8B", "")

because those characters don't show up in the c# string.

I also tried

byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);

But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).

Jimmy
  • 5,131
  • 9
  • 55
  • 81

3 Answers3

27

As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:

MailItem.Body.Replace("\u200B", "");
Robert S.
  • 1,942
  • 16
  • 22
  • To get rid of all similar unicode characters (see https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128) and replace them by the space character: Regex.Replace(textWithUnicodeCharacters, @"\s", " ") – xhafan May 04 '23 at 14:27
3

As all the Regex.Replace() methods operate on strings, that's not going to be useful here.

The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:

        StringBuilder newText = new StringBuilder();

        for (int i = 0; i < MailItem.Body.Length; i++)
        {
            if (a[i] != '\u200b')
            {
                newText.Append(a[i]);
            }
        } 
dyson
  • 866
  • 6
  • 12
0

Use System.Web.HttpUtility.HtmlDecode(string); Quite simple.

gustavomcastro
  • 155
  • 1
  • 9
  • 1
    In my experience, this does not remove all invisible whitespace characters, as I was still left with a string of length 1 that appeared empty, and did not trip `string.IsNullOrWhitespace` – Jules Mar 05 '19 at 22:48