Simplest way to get rid of zero-width-space in c# string

Question

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see

=E2=80=8B

at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.

What is the easiest way to get rid of this exact sequence? I cannot do the obvious

MailItem.Body.Replace("=E2=80=8B", "")

because those characters don't show up in the c# string.

I also tried

byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);

But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).

score 27 · Accepted Answer · answered Jul 24 '14 at 22:36

27

As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:

MailItem.Body.Replace("\u200B", "");

answered Jul 24 '14 at 22:36

Robert S.

1,942
16
22

To get rid of all similar unicode characters (see https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128) and replace them by the space character: Regex.Replace(textWithUnicodeCharacters, @"\s", " ") – xhafan May 04 '23 at 14:27

score 3 · Answer 2 · answered Jul 24 '14 at 19:54

As all the Regex.Replace() methods operate on strings, that's not going to be useful here.

The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:

        StringBuilder newText = new StringBuilder();

        for (int i = 0; i < MailItem.Body.Length; i++)
        {
            if (a[i] != '\u200b')
            {
                newText.Append(a[i]);
            }
        }

This answer works as well, but Robert S.' is more succinct so I accepted that one. — Jimmy, Jul 25 '14 at 00:57

score 0 · Answer 3 · answered Jan 19 '17 at 18:05

0

Use System.Web.HttpUtility.HtmlDecode(string); Quite simple.

answered Jan 19 '17 at 18:05

gustavomcastro

155
1
9

1

In my experience, this does not remove all invisible whitespace characters, as I was still left with a string of length 1 that appeared empty, and did not trip `string.IsNullOrWhitespace` – Jules Mar 05 '19 at 22:48

Simplest way to get rid of zero-width-space in c# string

3 Answers3

Linked

Related