33

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?

by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code

  • What characters are supposed to be non-printable? You need to build the regex character class for that. Perhaps, you just want `\p{C}` (= *invisible control characters and unused code points*), or `\p{Cc}` (just control characters, see http://www.regular-expressions.info/posixbrackets.html). – Wiktor Stribiżew Nov 12 '16 at 17:14

4 Answers4

100

You may remove all control and other non-printable characters with

s = Regex.Replace(s, @"\p{C}+", string.Empty);

The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.

Breaking it down into subcategories

  • To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
  • To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
  • To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
  • To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I up-voted for the interesting approach and Unicode compatibility, though I would argue that tab, carriage-return, line-feed, etc could be interpreted as "printable" in some applications (eg, my own), so I prefer the approach shown [here](https://stackoverflow.com/a/14568679/5818981). – SteveCinq Jan 07 '19 at 03:49
  • 11
    @SteveCinq Then you may also use `@"[\p{C}-[\r\n\t]]+"` and add any other symbol into the nested brackets that you want to avoid replacing. – Wiktor Stribiżew Jan 07 '19 at 07:50
  • 2
    Here is the list of **Supported Unicode general categories** https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#SupportedUnicodeGeneralCategories Tip: Take a look at the **Supported named blocks** (below) – Doomjunky Jan 01 '20 at 11:27
  • @Doomjunky That is a good reference, just note that those Unicode category classes do not match Unicode code points, they only match code units, that is why I added hex patterns that include all relevant code points. These code points can be obtained at [Unicode Utilities: UnicodeSet](https://unicode.org/cldr/utility/list-unicodeset.jsp). – Wiktor Stribiżew Jan 01 '20 at 16:03
  • TIL [Astral Plane](https://everything2.com/title/Astral+Plane) unicode. – HackSlash Apr 26 '23 at 22:27
6

You can try with :

string s = "Täkörgåsmrgås";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);


Updated answer after comments:

Documentation about non-printable character: https://en.wikipedia.org/wiki/Control_character

Char.IsControl Method:

https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx

Maybe you can try:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
Yanga
  • 2,885
  • 1
  • 29
  • 32
  • thank you very much, i will try this, i only needed the regex, i have the code, my fear was not to loose any printable characters in the way :) –  Nov 12 '16 at 16:07
  • I think you will, the output is : "Tkrgsmrgs", can you gave an exemple of which characters you wish to remove ? – Yanga Nov 12 '16 at 16:09
  • then this is not what i want, this is what i was afraid of, loosing characters from the screen. My goal is to remove characters, that are not shown in the screen, but they exist in it, and are useless, to give you an example, in JAVA i can keep all those characters with a \p{Print} –  Nov 12 '16 at 16:19
  • This worked for me. My problem was different such I need to support UTF-8 but I also want to strip out the control characters. – Bryan Harrington Mar 16 '22 at 19:57
2

To remove all control and other non-printable characters

Regex.Replace(s, @"\p{C}+", String.Empty);

To remove the control characters only (if you don't want to remove the emojis )

Regex.Replace(s, @"\p{Cc}+", String.Empty);
Nerdroid
  • 13,398
  • 5
  • 58
  • 69
2

you can try this:

    public static string TrimNonAscii(this string value)
    {
        string pattern = "[^ -~]*";
        Regex reg_exp = new Regex(pattern);
        return reg_exp.Replace(value, "");
    }