Regex for all PRINTABLE characters

Question

Is there a special regex statement like \w that denotes all printable characters? I'd like to validate that a string only contains a character that can be printed--i.e. does not contain ASCII control characters like \b (bell), or null, etc. Anything on the keyboard is fine, and so are UTF chars.

If there isn't a special statement, how can I specify this in a regex?

If you were looking for pure ASCII characters, you could go with a Regex like `[ -~]+`, which matches every low ASCII from space to tilde. — saluce, Apr 24 '15 at 15:11
This was a good resource for this issue for me: https://www.regular-expressions.info/unicode.html#category — nuiun, Jan 28 '22 at 10:01

score 51 · Answer 1 · answered Jul 31 '15 at 07:29

Very late to the party, but this regexp works: /[ -~]/.

How? It matches all characters in the range from space (ASCII DEC 32) to tilde (ASCII DEC 126), which is the range of all printable characters.

If you want to strip non-ASCII characters, you could use something like:

$someString.replace(/[^ -~]/g, '');

NOTE: this is not valid .net code, but an example of regexp usage for those who stumble upon this via search engines later.

score 32 · Answer 2 · answered Aug 14 '09 at 08:50

If your regex flavor supports Unicode properties, this is probably the best the best way:

\P{Cc}

That matches any character that's not a control character, whether it be ASCII -- [\x00-\x1F\x7F] -- or Latin1 -- [\x80-\x9F] (also known as the C1 control characters).

The problem with POSIX classes like [:print:] or \p{Print} is that they can match different things depending on the regex flavor and, possibly, the locale settings of the underlying platform. In Java, they're strictly ASCII-oriented. That means \p{Print} matches only the ASCII printing characters -- [\x20-\x7E] -- while \P{Cntrl} (note the capital 'P') matches everything that's not an ASCII control character -- [^\x00-\x1F\x7F]. That is, it matches any ASCII character that isn't a control character, or any non-ASCII character--including C1 control characters.

HoldOffHunger · Answer 3 · 2021-06-30T15:18:44.937

TLDR Answer

Use this Regex...

\P{Cc}\P{Cn}\P{Cs}

Working Demo

In this demo, I use this regex to search the string "Hello, World!_". I'm going to add a weird character at the end, (char)4 — this is the character for END TRANSMISSION.

using System;
using System.Text.RegularExpressions;

public class Test {
    public static void Main() {
        // your code goes here
        var regex = new Regex(@"![\P{Cc}\P{Cn}\P{Cs}]");
        var matches = regex.Matches("Hello, World!" + (char)4);
        Console.WriteLine("Results: " + matches.Count);
        foreach (Match match in matches) {
            Console.WriteLine("Result: " + match);
        }
    }
}

Full Working Demo at IDEOne.com

TLDR Explanation

\P{Cc} : Do not match control characters.
\P{Cn} : Do not match unassigned characters.
\P{Cs} : Do not match UTF-8-invalid characters.

Alternatives

\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.

Source and Explanation

Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!

All Matchable Unicode Character Sets

If you want to know any other character sets available, check out regular-expressions.info...

\p{L} or \p{Letter}: any kind of letter from any language.
- \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
- \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
- \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
- \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
- \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
- \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
- \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
- \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
- \p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
- \p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
- \p{Zl} or \p{Line_Separator}: line separator character U+2028.
- \p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
- \p{Sm} or \p{Math_Symbol}: any mathematical symbol.
- \p{Sc} or \p{Currency_Symbol}: any currency sign.
- \p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
- \p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
- \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
- \p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
- \p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
- \p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
- \p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
- \p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
- \p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
- \p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
- \p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
- \p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
- \p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
- \p{Cf} or \p{Format}: invisible formatting indicator.
- \p{Co} or \p{Private_Use}: any code point reserved for private use.
- \p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
- \p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

score 16 · Accepted Answer · edited Jan 20 '18 at 02:11

16

There is a POSIX character class designation [:print:] that should match printable characters, and [:cntrl:] for control characters. Note that these match codes throughout the ASCII table, so they might not be suitable for matching other encodings.

Failing that, the expression [\x00-\x1f] will match through the ASCII control characters, although again, these could be printable in other encodings.

edited Jan 20 '18 at 02:11

Joshua Dawson

629
10
17

answered Aug 08 '09 at 02:10

zombat

92,731
24
156
164

1

seems like the second misses `\x7f` (DEL) – ps2goat Jun 11 '18 at 17:50

score 3 · Answer 5 · answered Aug 08 '09 at 06:43

3

In Java, the \p{Print} option specifies the printable character class.

answered Aug 08 '09 at 06:43

hashable

3,791
2
23
22

score 1 · Answer 6 · answered Aug 08 '09 at 02:52

It depends wildly on what regex package you are using. This is one of these situations about which some wag said that the great thing about standards is there are so many to choose from.

If you happen to be using C, the isprint(3) function/macro is your friend.

score 1 · Answer 7 · edited Sep 15 '17 at 11:04

1

Adding on to @Alan-Moore, \P{Cc} is actually as example of Negative Unicode Category or Unicode Block (ref: Character Classes in Regular Expressions). \P{name} matches any character that does not belong to a Unicode general category or named block. See the referred link for more examples of named blocks supported in .Net

edited Sep 15 '17 at 11:04

Jonathan Sayce

9,359
5
37
51

answered Jan 08 '16 at 20:38

Adarsha

2,267
22
29