2

I have to clear hex characters from exception message in a better way. For now it works replacing characters manually which seems total disaster like this :

            var clearedStr = str.Replace(Convert.ToString((char)0x01), "")
            .Replace(Convert.ToString((char)0x02), "")
            .Replace(Convert.ToString((char)0x03), "")
            .Replace(Convert.ToString((char)0x04), "")
            .Replace(Convert.ToString((char)0x05), "")
            .Replace(Convert.ToString((char)0x06), "")
            .Replace(Convert.ToString((char)0x07), "")
            .Replace(Convert.ToString((char)0x08), "")
            .Replace(Convert.ToString((char)0x0B), "")
            .Replace(Convert.ToString((char)0x0C), "")
            .Replace(Convert.ToString((char)0x0E), "")
            .Replace(Convert.ToString((char)0x0F), "")
            .Replace(Convert.ToString((char)0x10), "")
            .Replace(Convert.ToString((char)0x11), "")
            .Replace(Convert.ToString((char)0x12), "")
            .Replace(Convert.ToString((char)0x13), "")
            .Replace(Convert.ToString((char)0x14), "")
            .Replace(Convert.ToString((char)0x15), "")
            .Replace(Convert.ToString((char)0x16), "")
            .Replace(Convert.ToString((char)0x17), "")
            .Replace(Convert.ToString((char)0x18), "")
            .Replace(Convert.ToString((char)0x19), "")
            .Replace(Convert.ToString((char)0x1a), "")
            .Replace(Convert.ToString((char)0x1b), "")
            .Replace(Convert.ToString((char)0x1c), "")
            .Replace(Convert.ToString((char)0x1d), "")
            .Replace(Convert.ToString((char)0x1e), "")
            .Replace(Convert.ToString((char)0x84), "")
            .Replace(Convert.ToString((char)0x86), "")
            .Replace(Convert.ToString((char)0x87), "")
            .Replace(Convert.ToString((char)0x88), "")
            .Replace(Convert.ToString((char)0x89), "");

The message for example like this with hex characters :

some of these characters

Actually I wrote a regex but it works for hex character like 0x1e, but not for its equivalent :

But i need to find these characters, not hex equivalent :

"","‘","ƒ","","","’","","š","ˆ","‰","Š","‹","Œ","","„", "†", "‡"

Same characters with their symbols :

"RS: , PU1 : ‘, NBH : ƒ, US : , ESC : , PU2: ’, GS : , SCI: š, HTS: ˆ, HTJ : ‰, VTS : Š, PLD : ‹, PLU: Œ, SUB :, IND: „, SSA: †, ESA : ‡"

The regex is that I wrote :

http://regexstorm.net/tester?p=%5b0-9%5dx%5b0-9A-F%5d&i=0x1e+0x91+0x1c+0x83

Also, I need to cover all of this kind of chracters, not a bunch of them.

example of characters

cansu
  • 958
  • 1
  • 12
  • 23
  • ASCII Encoding will remove all non printable characters. – jdweng Aug 31 '20 at 13:47
  • 1
    @jdweng that's not true. ASCII contains a bunch of non printable characters, Line Feed (0x9) comes to mind as well as [a bunch more](https://web.itu.edu.tr/sgunduz/courses/mikroisl/ascii.html) – MindSwipe Aug 31 '20 at 13:50
  • This might help: https://stackoverflow.com/questions/3253247/how-do-i-detect-non-printable-characters-in-net – Klaus Gütter Aug 31 '20 at 13:50
  • @jdweng i try with online encoder. i need to show this data so if i encde this text, i need to decode to show message to user. so when i decode probably it turns to initial character again. but still i need to test it in detail. – cansu Aug 31 '20 at 13:54
  • @MindSwipe : A linefeed is a printable character since it causes the printer to move to next line. A Bell, SOT, and EOT would be non printable – jdweng Aug 31 '20 at 13:58
  • If you are going to do repetitive `Replace` calls like you show, consider using `StringBuilder.Replace` rather than `string.Replace`. It generates a lot less *garbage* to be collected. – Flydog57 Aug 31 '20 at 13:59
  • @jdweng it affects the output yes, but the character itself isn't printed, and that is by definition a non printable character – MindSwipe Aug 31 '20 at 14:01
  • @KlausGütter thank you, actually it seems Char.IsControl('') returns true, and Char.IsControl('x') returns false. this is good but then maybe we have a performance issue then. Probably i have long text, it could be problem to check all large string to check. but anyway maybe could be a part of final solution. thanks. – cansu Aug 31 '20 at 14:01
  • A non printable character is characters that the printer does not use. The printer definitely uses the return. – jdweng Aug 31 '20 at 14:08
  • @jdweng "Non-printing charachters [...] are characters [...] which aren't displayed at printing. [...] The most common non-printable characters are [...] **Tab character** etc." (emphasize mine), the Wikipedia article continues to name a few more. A non printing character is a character the printer doesn't print, as you cannot see the newline character on the resulting, but still uses it and inserts a new line instead – MindSwipe Aug 31 '20 at 14:22
  • How about this.. check this regex. [0-9]x[0-9A-Fa-f].+? – Balaji J Aug 31 '20 at 13:50

3 Answers3

5

As MindSwipe suggests, you may use \p{C} to match any control character.

But you do not need to add a lot of code to subtract some characters you might want to keep, use character class subtraction:

var output = Regex.Replace(YourTextVariable, @"[\p{C}-[\t\r\n]]+", "");

This will match one or more control characters different from tab, carriage return and line feed.

Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
  • This is vastly superior to my answer. I also learned something new about Regex, thanks. @cansu you should really accept this answer instead of mine – MindSwipe Sep 03 '20 at 06:09
  • 1
    You are right @MindSwipe this answer is more accurate actually. Thank you for your contribution also. – cansu Sep 03 '20 at 06:42
2

Before reading further, please take a look Ryszard Czech's answer on how to do this without any of the superfluous code of adding newlines back


This can achieved by replacing every control character in your string, and luckily Regex has the answer:

var s = "a \nb" + Convert.ToString((char)0x1b) + Convert.ToString((char) 0x1e);
Regex.Replace(s, @"\p{C}+", String.Empty);

@"\p{C}+" matches all control characters. Be warned, this will also match new lines (\n), meaning your output won't have any newlines as you can see in this example. If you want your newlines to be kept, you'll have to first split your string into an array, and Regex.Replace on each line, and the put them together again. Something like so:

var lines = s.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
var sb = new StringBuilder();

foreach (var line in lines)
{
    sb.AppendLine(Regex.Replace(line, @"\p{C}+", String.Empty));
}

s = sb.ToString();

This leaves a trailing newline, which can easily be removed like so:

if (sb[sb.Length - 1] == '\n')
    sb.Remove(sb.Length - 1, 1);

Do this before calling sb.ToString(). Here is a dotnetfiddle demonstrating this

MindSwipe
  • 7,193
  • 24
  • 47
0

Sometimes a good old foreach is the right way to go. How about:

 private static readonly char[] CharsToReplace =
 {
     '\x02',
     '\x03',
     '\x04',
     '\x05',
     '\x06',
     '\x07',
     '\x08',
     '\x0B',
     '\x0C',
     '\x0E',
     '\x0F',
     '\x10',
     '\x11',
     '\x12',
     '\x13',
     '\x14',
     '\x15',
     '\x16',
     '\x17',
     '\x18',
     '\x19',
     '\x1a',
     '\x1b',
     '\x1c',
     '\x1d',
     '\x1e',
     '\x84',
     '\x86',
     '\x87',
     '\x88',
     '\x89',
 };

public static string ReplaceNonPrintables(string stringToProcess)
{
    StringBuilder buf = new StringBuilder(stringToProcess.Length);
    foreach (var c in stringToProcess)
    {
        if (!CharsToReplace.Contains(c))
        {
            buf.Append(c)
        }
    }

    return buf.ToString();
}
Flydog57
  • 6,851
  • 2
  • 17
  • 18
  • thanks, this could be solution if any other ideas do not solve the problem. – cansu Aug 31 '20 at 14:21
  • The drawback is that it is O(NxM) (where N is the string length and M is the number of characters to remove). It might be faster to use a `Hashset` – Flydog57 Aug 31 '20 at 14:26