2

I have looked at quite a number of related SO posts pertaining to this. I have this malformed string that contains unicode characters which I want to strip away.

string testString = "\0\u0001\0\0\0����\u0001\0\0\0\0\0\0\0\u0011\u0001\0\0\0\u0004\0\0\0\u0006\u0002\0\0\0\u0005The\u0006\u0003\0\0\0\u0017boy\u0006\u0004\0\0\0\tKicked\u0006\u0005\0\0\0\u0013the Ball\v";

I would like the following output:

The boy kicked the Ball

How can I achieve this?

I have looked at the below (With not much success):

  1. How can you strip non-ASCII characters from a string? (in C#)
  2. Converting unicode characters (C#) Testing
  3. How to Remove '\0' from a string in C#?
  4. Removing unwanted character from column (SQL Server related so not relevant in my question)
Harold_Finch
  • 682
  • 2
  • 12
  • 33
  • What's the actual source of `testString`? I assume it's not hard-coded like that in your real code. – Enigmativity Jun 26 '20 at 04:29
  • @Enigmativity I got this as a result of doing decryption on an encrypted byte[] array via RSA asymmetric encryption i.e string `testString = Encoding.UTF8.GetString(encryptedByteArray, 0, encryptedByteArray.Length);` gave me what I posted in the question. I only just changed the actual strings – Harold_Finch Jun 26 '20 at 04:38
  • Then I suspect that you need to get your decryption character encoding right. I don't think this is an issue of stripping the Unicode characters. You're doing two wrongs, which isn't a right. Can you please post you decrypted byte array? Then we can probably get your string cleanly without the need to strip anything. – Enigmativity Jun 26 '20 at 05:41

5 Answers5

2
public string ReturnCleanASCII(string s)
    {
        StringBuilder sb = new StringBuilder(s.Length);
        foreach (char c in s)
        {
            if ((int)c > 127) // you probably don't want 127 either
                continue;
            if ((int)c < 32)  // I bet you don't want control characters 
                continue;
            if (c == '%')
                continue;
            if (c == '?')
                continue;
            sb.Append(c);
        }

        
        return sb.ToString();
    }
payam purchi
  • 226
  • 1
  • 11
1

testString = Regex.Replace(testString, @"[\u0000-\u0008\u000A-\u001F\u0100-\uFFFF]", "");

or

testString = Regex.Replace(testString, @"[^\t\r\n -~]", "");

JoelFan
  • 37,465
  • 35
  • 132
  • 205
1

I use this regular expression to filter out bad characters in a filename.

Regex.Replace(directory, "[^a-zA-Z0-9\\:_\- ]", "")
sep7696
  • 494
  • 2
  • 16
0

Try this:

string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);

Hope it helps.

Raul Marquez
  • 948
  • 1
  • 17
  • 27
0

Why not instead of trying to remove the unicode chars, you just extract all ASCII chars:

var str = string.Join(" ",new Regex("[ -~]+").Matches(testString).Select(m=>m.Value));
JohanP
  • 5,252
  • 2
  • 24
  • 34