TLDR Answer
Use string1.match(/[\p{Cc}\p{Cn}\p{Cs}]+/gu)
as a conditional, with true meaning that string1
contains any unprintable characters.
Or, if you want the logical equivalent, string1.match(/^[\P{Cc}\P{Cn}\P{Cs}]+$/gu)
as a conditional will return true if string1
only contains printable characters.
TLDR Explanation
\P{Cc}
: Do not match control characters.
\P{Cn}
: Do not match unassigned characters.
\P{Cs}
: Do not match UTF-8-invalid characters.
+
: Make sure that something is found, i.e., this will also mean that ""
, the blank string, will not be considered printable.
/g
: Greedy match, exhaustively/greedily search the string for the character sets indicated.
/u
: The unicode regex operator for matching on unicode character points. (Source: MDN Web Docs: Regular Expressions; Unicode Property Escapes.)
Demo
var string1 = 'This string has unprintable characters \u0001';
if(string1.match(/[\p{Cc}\p{Cn}\p{Cs}]+/gu)) {
console.log("Unprintable string: " + string1);
}
var string2 = 'This string has only printable characters.';
if(string2.match(/^[\P{Cc}\P{Cn}\P{Cs}]+$/gu)) {
console.log("Printable string: " + string2);
}
Possible Alternatives
\P{C}
: Match only visible characters. Do not match any invisible characters.
\P{Cc}
: Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn}
: Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs}
: Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf}
: Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
This regex will match anything visible, given in both its short-hand and long-hand form...
\p{L}\p{M}\p{N}\p{P}\p{S}\p{Z}
\p{Letter}\p{Mark}\p{Number}\p{Punctuation}\p{Symbol}\p{Separator}
\p
indicates that it's something we want to match, but we also have the option to use \P
(capitalized) to indicate something that does not match. So, that means we can use the \p{C}
class, used for "invisible control characters and unused code points." (Source: Regular-Expressions.info.) A simpler regex then would be \P{C}
, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\p{L}
or \p{Letter}
: any kind of letter from any language.
\p{Ll}
or \p{Lowercase_Letter}
: a lowercase letter that has an uppercase variant.
\p{Lu}
or \p{Uppercase_Letter}
: an uppercase letter that has a lowercase variant.
\p{Lt}
or \p{Titlecase_Letter}
: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&}
or \p{Cased_Letter}
: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm}
or \p{Modifier_Letter}
: a special character that is used like a letter.
\p{Lo}
or \p{Other_Letter}
: a letter or ideograph that does not have lowercase and uppercase
\p{M}
or \p{Mark}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn}
or \p{Non_Spacing_Mark}
: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc}
or \p{Spacing_Combining_Mark}
: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me}
or \p{Enclosing_Mark}
: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z}
or \p{Separator}
: any kind of whitespace or invisible separator.
\p{Zs}
or \p{Space_Separator}
: a whitespace character that is invisible, but does take up space.
\p{Zl}
or \p{Line_Separator}
: line separator character U+2028.
\p{Zp}
or \p{Paragraph_Separator}
: paragraph separator character U+2029.
\p{S}
or \p{Symbol}
: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm}
or \p{Math_Symbol}
: any mathematical symbol.
\p{Sc}
or \p{Currency_Symbol}
: any currency sign.
\p{Sk}
or \p{Modifier_Symbol}
: a combining character (mark) as a full character on its own.
\p{So}
or \p{Other_Symbol}
: various symbols that are not math symbols, currency signs, or combining characters.
\p{N}
or \p{Number}
: any kind of numeric character in any script.
\p{Nd}
or \p{Decimal_Digit_Number}
: a digit zero through nine in any script except ideographic scripts.
\p{Nl}
or \p{Letter_Number}
: a number that looks like a letter, such as a Roman numeral.
\p{No}
or \p{Other_Number}
: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P}
or \p{Punctuation}
: any kind of punctuation character.
\p{Pd}
or \p{Dash_Punctuation}
: any kind of hyphen or dash.
\p{Ps}
or \p{Open_Punctuation}
: any kind of opening bracket.
\p{Pe}
or \p{Close_Punctuation}
: any kind of closing bracket.
\p{Pi}
or \p{Initial_Punctuation}
: any kind of opening quote.
\p{Pf}
or \p{Final_Punctuation}
: any kind of closing quote.
\p{Pc}
or \p{Connector_Punctuation}
: a punctuation character such as an underscore that connects words.
\p{Po}
or \p{Other_Punctuation}
: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C}
or \p{Other}
: invisible control characters and unused code points.
\p{Cc}
or \p{Control}
: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf}
or \p{Format}
: invisible formatting indicator.
\p{Co}
or \p{Private_Use}
: any code point reserved for private use.
\p{Cs}
or \p{Surrogate}
: one half of a surrogate pair in UTF-16 encoding.
\p{Cn}
or \p{Unassigned}
: any code point to which no character has been assigned.