18

Looking for a regular expression for that validates all printable characters. The regex needs to be used in JavaScript only. I have gone through this post but it mostly talks about .net, Java and C but not JavaScript.

You have to allow only these printable characters :

a-z, A-Z, 0-9, and the thirty-two symbols: !"#$%&'()*+,-./:;<=>?@[] ^_`{|}~ and space

Need a JavaScript regex to validate the input characters is one of the above and discard the rest.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
AurA
  • 12,135
  • 7
  • 46
  • 63
  • All? Are you sure? Are you aware of just how many unicode characters there are? – Ariel Aug 21 '12 at 10:21
  • 3
    Unfortunately javascript does not support unicode character classes: http://stackoverflow.com/questions/280712/javascript-unicode – Ariel Aug 21 '12 at 10:23
  • Unicode UTF-16 has some 2^16 I guess. – AurA Aug 21 '12 at 10:25
  • 1
    @AurA: Not even close. You definitely need to read Joel's [Unicode article](http://www.joelonsoftware.com/articles/Unicode.html) before venturing any further into this. – Tim Pietzcker Aug 21 '12 at 10:54

5 Answers5

17

If you want to match all printable characters in the UTF-8 set (as indicated by your comment on Aug 21), you're going to have a hard time doing this yourself. JavaScript's native regexes have abysmal Unicode support. But you can use XRegExp with the regex ^\P{C}*$.

If you only want to match those few ASCII letters you mentioned in the edit to your post from Aug 22, then the regex is trivial:

/^[a-z0-9!"#$%&'()*+,.\/:;<=>?@\[\] ^_`{|}~-]*$/i
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • that I guess is a good solution but suppose I want printable characters for UTF-8 only, can you get me a regular expression without using any third party JavaScript library. – AurA Aug 21 '12 at 10:30
  • @AurA: XRegExp compiles down to native JavaScript. – Tim Pietzcker Aug 21 '12 at 10:45
  • 1
    I already know that but I have that restriction here... that I cannot use a third party library. That is why i am asking for UTF-8 only, that would reduce that number of characters drastically and can be handled with regex. – AurA Aug 21 '12 at 10:48
  • 5
    UTF-8 has *EXACTLY* the same number of characters as UTF-16 and UTF-32. UTF-8 is just an encoding - it has ALL of unicode - the entire thing. Did you mean ASCII? – Ariel Aug 21 '12 at 10:50
  • http://en.wikipedia.org/wiki/UTF-8 Out of these given characters I want to check if the entered string has any unprintable character or on keypress I want to check if entered character is printable. – AurA Aug 21 '12 at 11:16
13

For non-unicode use regex pattern ^[^\x00-\x1F\x80-\x9F]+$


If you want to work with unicode, first read Javascript + Unicode regexes.

I would suggest then to use regex pattern ^[^\p{Cc}\p{Cf}\p{Zl}\p{Zp}]*$

  • \p{Cc} or \p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
  • \p{Cf} or \p{Format}: invisible formatting indicator.
  • \p{Zl} or \p{Line_Separator}: line separator character U+2028.
  • \p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.

For more information see http://www.regular-expressions.info/unicode.html

avetisk
  • 11,651
  • 4
  • 24
  • 37
Ωmega
  • 42,614
  • 34
  • 134
  • 203
13

To validate a string only consists of printable ASCII characters, use a simple regex like

/^[ -~]+$/

It matches

  • ^ - the start of string anchor
  • [ -~]+ - one or more (due to + quantifier) characters that are within a range from space till a tilde in the ASCII table:

enter image description here
- $ - end of string anchor

For Unicode printable chars, use \PC Unicode category (matching any char but a control char) from XRegExp, as has already been mentioned:

^\PC+$

See regex demos:

// ASCII only
var ascii_print_rx = /^[ -~]+$/;
console.log(ascii_print_rx.test("It's all right.")); // true
console.log(ascii_print_rx.test('\f ')); // false, \f is an ASCII form feed char
console.log(ascii_print_rx.test("demásiado tarde")); // false, no Unicode printable char support
// Unicode support
console.log(XRegExp.test('demásiado tarde', XRegExp("^\\PC+$"))); // true
console.log(XRegExp.test('‌ ', XRegExp("^\\PC+$"))); // false, \u200C is a Unicode zero-width joiner
console.log(XRegExp.test('\f ', XRegExp("^\\PC+$"))); // false, \f is an ASCII form feed char
<script src="http://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
Graham
  • 7,431
  • 18
  • 59
  • 84
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
11

Looks like JavaScript has changed to some degree since this question was posted?

I'm using this one:

var regex = /^[\u0020-\u007e\u00a0-\u00ff]*$/;
console.log( regex.test("!\"#$%&'()*+,-./:;<=>?@[] ^_`{|}~")); //should output "true" 
console.log( regex.test("Iñtërnâtiônàlizætiøn")); //should output "true"
console.log( regex.test("☃")); //should output "false" 
RevelationX
  • 161
  • 1
  • 8
6

TLDR Answer

Use string1.match(/[\p{Cc}\p{Cn}\p{Cs}]+/gu) as a conditional, with true meaning that string1 contains any unprintable characters.

Or, if you want the logical equivalent, string1.match(/^[\P{Cc}\P{Cn}\P{Cs}]+$/gu) as a conditional will return true if string1 only contains printable characters.

TLDR Explanation

  • \P{Cc} : Do not match control characters.
  • \P{Cn} : Do not match unassigned characters.
  • \P{Cs} : Do not match UTF-8-invalid characters.
  • + : Make sure that something is found, i.e., this will also mean that "", the blank string, will not be considered printable.
  • /g : Greedy match, exhaustively/greedily search the string for the character sets indicated.
  • /u : The unicode regex operator for matching on unicode character points. (Source: MDN Web Docs: Regular Expressions; Unicode Property Escapes.)

Demo

var string1 = 'This string has unprintable characters \u0001';

if(string1.match(/[\p{Cc}\p{Cn}\p{Cs}]+/gu)) {
  console.log("Unprintable string: " + string1);
}
var string2 = 'This string has only printable characters.';

if(string2.match(/^[\P{Cc}\P{Cn}\P{Cs}]+$/gu)) {
  console.log("Printable string: " + string2);
}

Possible Alternatives

  • \P{C} : Match only visible characters. Do not match any invisible characters.
  • \P{Cc} : Match only non-control characters. Do not match any control characters.
  • \P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
  • \P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
  • \P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.

Source and Explanation

Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!

This regex will match anything visible, given in both its short-hand and long-hand form...

\p{L}\p{M}\p{N}\p{P}\p{S}\p{Z}
\p{Letter}\p{Mark}\p{Number}\p{Punctuation}\p{Symbol}\p{Separator}

\p indicates that it's something we want to match, but we also have the option to use \P (capitalized) to indicate something that does not match. So, that means we can use the \p{C} class, used for "invisible control characters and unused code points." (Source: Regular-Expressions.info.) A simpler regex then would be \P{C}, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.

All Matchable Unicode Character Sets

If you want to know any other character sets available, check out regular-expressions.info...

  • \p{L} or \p{Letter}: any kind of letter from any language.
    • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
    • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
    • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
    • \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    • \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
    • \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase
  • \p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    • \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    • \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
  • \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
    • \p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
    • \p{Zl} or \p{Line_Separator}: line separator character U+2028.
    • \p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
  • \p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
    • \p{Sm} or \p{Math_Symbol}: any mathematical symbol.
    • \p{Sc} or \p{Currency_Symbol}: any currency sign.
    • \p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
    • \p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
  • \p{N} or \p{Number}: any kind of numeric character in any script.
    • \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
    • \p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
    • \p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
  • \p{P} or \p{Punctuation}: any kind of punctuation character.
    • \p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
    • \p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
    • \p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
    • \p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
    • \p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
    • \p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
    • \p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
  • \p{C} or \p{Other}: invisible control characters and unused code points.
    • \p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
    • \p{Cf} or \p{Format}: invisible formatting indicator.
    • \p{Co} or \p{Private_Use}: any code point reserved for private use.
    • \p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
    • \p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133