76

How can I remove control characters like STX from a PHP string? I played around with

preg_replace("/[^a-zA-Z0-9 .\-_;!:?äÄöÖüÜß<>='\"]/","",$pString)

but found that it removed way to much. Is there a way to remove only control chars?

halfer
  • 19,824
  • 17
  • 99
  • 186
KB22
  • 6,899
  • 9
  • 43
  • 52
  • The following links might help you :
    [ASCII Characters Table](http://web.cs.mun.ca/~michael/c/ascii-table.html)
    [POSIX refrence](http://www.regular-expressions.info/posixbrackets.html)
    [Regular expressions](http://w3.pppl.gov/info/grep/Regular_Expressions.html)
    – Rohutech Aug 20 '11 at 09:28

6 Answers6

133

If you mean by control characters the first 32 ascii characters and \x7F (that includes the carriage return, etc!), then this will work:

preg_replace('/[\x00-\x1F\x7F]/', '', $input);

(Note the single quotes: with double quotes the use of \x00 causes a parse error, somehow.)

The line feed and carriage return (often written \r and \n) may be saved from removal like so:

preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $input);

I must say that I think Bobby's answer is better, in the sense that [:cntrl:] better conveys what the code does than [\x00-\x1F\x7F].

WARNING: ereg_replace is deprecated in PHP >= 5.3.0 and removed in PHP >= 7.0.0!, please use preg_replace instead of ereg_replace:

preg_replace('/[[:cntrl:]]/', '', $input);
Stephan202
  • 59,965
  • 13
  • 127
  • 133
  • 7
    sadly ereg_replace is deprecated in PHP 5.3 and the mb version is slower than preg_replace. There is a slightly cleaner way to do this with preg_replace, and in my testing it is very slightly faster (1% faster when dealing with hundreds of thousands of items) than the one above: preg_replace('/[\p{Cc}]/', '', $input); – Jay Paroline Jun 18 '10 at 21:53
  • 9
    Additionally, preg_replace('/[[:cntrl:]]/', '', $input); worked for me just fine (php 5.2.6). – ford Dec 03 '10 at 23:22
  • not working for me, this string >>"Rua Enette Dubard, 806 - Loja 2" is converted to this >> "Rua Eee Dubad, 806 - Loja 2" and carriage return char is still there. – ruhalde Apr 05 '12 at 04:24
  • 2
    Not that you also may want to save tabs "\t". I found this question because I was getting \x1D in my database. – jcampbell1 Sep 24 '12 at 20:06
  • Check this for why preg_replace('/[[:cntrl:]]/', '', $input); woks: http://stackoverflow.com/questions/475159/php-regex-what-is-class-at-offset-0 – David Oct 15 '13 at 17:25
  • For sanitising console input the first preg_replace worked but not the second (which I thought was just an extension of the first) – myol Jun 11 '15 at 11:45
49

For Unicode input, this will remove all control characters, unassigned, private use, formatting and surrogate code points (that are not also space characters, such as tab, new line) from your input text. I use this to remove all non-printable characters from my input.

<?php
$clean = preg_replace('/[^\PC\s]/u', '', $input);

for more info on \p{C} see http://www.regular-expressions.info/unicode.html#category

JFK
  • 40,963
  • 31
  • 133
  • 306
Scott Jungwirth
  • 6,105
  • 3
  • 38
  • 35
  • Why do you use `\PC` instead of `\p{C}`? – syl.fabre Nov 08 '16 at 10:15
  • We have to use a negated character class to avoid removing spaces (since they are considered invisible), which means we need to use the inverse form of `\p{C}` – Scott Jungwirth Nov 08 '16 at 14:26
  • 1
    This is exactly what you need when sending user input to the Authorize.net API. In case anyone else is having invalid XML character errors. – Nostalg.io Jan 10 '17 at 21:54
  • In stupid people terms (yeah, that's me) can someone kindly explain how this works. It does work, I know I've implemented it with extensive unit test coverage, however when I read it back with my current understanding it doesn't make sense. The way I understand it is that it looks like it should replace anything that's not a control character or a white space with nothing? i.e. you'd end up with only the control characters and white space remaining...? Thanks in advance! – Chris Rosillo Jul 06 '17 at 08:41
  • 2
    Hi @ChrisRosillo, we use the inverse form of `\p{C}` which is `\PC`, so where `\p{C}` matches control characters, \PC matches everything that isn't a control character. Then we use a negated character class `[^..]` to say match/replace anything "not [ not a control character or space ]". So it is kind of a double negative. – Scott Jungwirth Jul 14 '17 at 19:17
  • 1
    @syl.fabre about the brackets: "If only one letter is specified with \p or \P, it includes all the properties that start with that letter. In this case, in the absence of negation, the curly brackets in the escape sequence are optional" – pmiguelpinto90 Dec 04 '20 at 12:13
  • You are a godsend. Having no idea about regex, this was the only one that worked in removing an obscure newline character that I've searched for in ages. – Gokigooooks Sep 04 '22 at 23:21
24

PHP does support POSIX-Classes so you can use [:cntrl:] instead of some fancy character-magic-stuff:

ereg_replace("[:cntrl:]", "", $pString);

Edit:

A extra pair of square brackets might be needed in 5.3.

ereg_replace("[[:cntrl:]]", "", $pString);
Bobby
  • 11,419
  • 5
  • 44
  • 69
  • 1
    PHP does support POSIX, using the ereg functions istead of preg: http://nl2.php.net/manual/en/book.regex.php – Duroth Sep 30 '09 at 12:57
  • In my testing, this only worked when adding an extra square bracket to the statement like so: ereg_replace("[[:cntrl:]]", "", $pString); PHP 5.3.5. – dereferenced Nov 01 '11 at 14:31
  • 2
    As `ereg_replace` is removed in PHP 7.0, for PHP > 7.0 it should be: `preg_replace("/[[:cntrl:]]/", "", $input);` – wowpatrick Jul 13 '21 at 09:08
11

TLDR Answer

Use this Regex...

/[^\PCc^\PCn^\PCs]/u

Like this...

$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);

TLDR Explanation

  • ^\PCc : Do not match control characters.
  • ^\PCn : Do not match unassigned characters.
  • ^\PCs : Do not match UTF-8-invalid characters.

Working Demo

Simple demo to demonstrate: IDEOne Demo

$text = "\u{0019}hello";
print($text . "\n\n");
$text = preg_replace('/[^\PCc^\PCn^\PCs]/u', '', $text);
print($text);

Output:

(-Broken-Character)hello
hello

Alternatives

  • ^\PC : Match only visible characters. Do not match any invisible characters.
  • ^\PCc : Match only non-control characters. Do not match any control characters.
  • ^\PCc^\PCn : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
  • ^\PCc^\PCn^\PCs : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
  • ^\PCc^\PCn^\PCs^\PCf : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.

Source and Explanation

Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!

This regex will match anything visible, given in both its short-hand and long-hand form...

\PL\PM\PN\PP\PS\PZ
\PLetter\PMark\PNumber\PPunctuation\PSymbol\PSeparator

Normally, \p indicates that it's something we want to match and we use \P (capitalized) to indicate something that does not match. But PHP doesn't have this functionality, so we need to use ^ in the regex to do a manual negation.

A simpler regex then would be ^\PC, but this might be too restrictive in deleting invisible formatting. You may want to look closely and see what's best, but one of the alternatives should fit your needs.

All Matchable Unicode Character Sets

If you want to know any other character sets available, check out regular-expressions.info...

  • \PL or \PLetter: any kind of letter from any language.
    • \PLl or \PLowercase_Letter: a lowercase letter that has an uppercase variant.
    • \PLu or \PUppercase_Letter: an uppercase letter that has a lowercase variant.
    • \PLt or \PTitlecase_Letter: a letter that appears at the start of a word when only the first letter of the word is capitalized.
    • \PL& or \PCased_Letter: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    • \PLm or \PModifier_Letter: a special character that is used like a letter.
    • \PLo or \POther_Letter: a letter or ideograph that does not have lowercase and uppercase
  • \PM or \PMark: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    • \PMn or \PNon_Spacing_Mark: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    • \PMc or \PSpacing_Combining_Mark: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \PMe or \PEnclosing_Mark: a character that encloses the character it is combined with (circle, square, keycap, etc.).
  • \PZ or \PSeparator: any kind of whitespace or invisible separator.
    • \PZs or \PSpace_Separator: a whitespace character that is invisible, but does take up space.
    • \PZl or \PLine_Separator: line separator character U+2028.
    • \PZp or \PParagraph_Separator: paragraph separator character U+2029.
  • \PS or \PSymbol: math symbols, currency signs, dingbats, box-drawing characters, etc.
    • \PSm or \PMath_Symbol: any mathematical symbol.
    • \PSc or \PCurrency_Symbol: any currency sign.
    • \PSk or \PModifier_Symbol: a combining character (mark) as a full character on its own.
    • \PSo or \POther_Symbol: various symbols that are not math symbols, currency signs, or combining characters.
  • \PN or \PNumber: any kind of numeric character in any script.
    • \PNd or \PDecimal_Digit_Number: a digit zero through nine in any script except ideographic scripts.
    • \PNl or \PLetter_Number: a number that looks like a letter, such as a Roman numeral.
    • \PNo or \POther_Number: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
  • \PP or \PPunctuation: any kind of punctuation character.
    • \PPd or \PDash_Punctuation: any kind of hyphen or dash.
    • \PPs or \POpen_Punctuation: any kind of opening bracket.
    • \PPe or \PClose_Punctuation: any kind of closing bracket.
    • \PPi or \PInitial_Punctuation: any kind of opening quote.
    • \PPf or \PFinal_Punctuation: any kind of closing quote.
    • \PPc or \PConnector_Punctuation: a punctuation character such as an underscore that connects words.
    • \PPo or \POther_Punctuation: any kind of punctuation character that is not a dash, bracket, quote or connector.
  • \PC or \POther: invisible control characters and unused code points.
    • \PCc or \PControl: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
    • \PCf or \PFormat: invisible formatting indicator.
    • \PCo or \PPrivate_Use: any code point reserved for private use.
    • \PCs or \PSurrogate: one half of a surrogate pair in UTF-16 encoding.
    • \PCn or \PUnassigned: any code point to which no character has been assigned.
HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
6

To keep the control characters but make them compatible for JSON, I had to to

$str = preg_replace(
    array(
        '/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
        '/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/', '/\x0A/',
        '/\x0B/','/\x0C/','/\x0D/', '/\x0E/', '/\x0F/', '/\x10/', '/\x11/',
        '/\x12/','/\x13/','/\x14/','/\x15/', '/\x16/', '/\x17/', '/\x18/',
        '/\x19/','/\x1A/','/\x1B/','/\x1C/','/\x1D/', '/\x1E/', '/\x1F/'
    ), 
    array(
        "\u0000", "\u0001", "\u0002", "\u0003", "\u0004",
        "\u0005", "\u0006", "\u0007", "\u0008", "\u0009", "\u000A",
        "\u000B", "\u000C", "\u000D", "\u000E", "\u000F", "\u0010", "\u0011",
        "\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018",
        "\u0019", "\u001A", "\u001B", "\u001C", "\u001D", "\u001E", "\u001F"
    ), 
    $str
);

(The JSON rules state: “All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).”)

Nikola Petkanski
  • 4,724
  • 1
  • 33
  • 41
Jamie
  • 61
  • 1
  • 1
1

regex free method

If you are only zapping the control characters I'm familiar with (those under 32 and 127), try this out:

 for($control = 0; $control < 32; $control++) {
     $pString = str_replace(chr($control), "", $pString;
 }

$pString = str_replace(chr(127), "", $pString;

The loop gets rid of all but DEL, which we just add to the end.

I'm thinking this will be a lot less stressful on you and the script then dealing with regex and the regex library.

Updated regex free method

Just for kicks, I came up with another way to do it. This one does it using an array of control characters:

$ctrls = range(chr(0), chr(31));
$ctrls[] = chr(127);

$clean_string = str_replace($ctrls, "", $string);
Anthony
  • 36,459
  • 25
  • 97
  • 163
  • 1
    How is this less "stressful" than ereg_replace("[:cntrl:]", "", $pString); ? Using ereg, the PHP interpreter will probably compile more efficient intermediate code than it would using that for loop anyway. – glomad Sep 30 '09 at 15:08
  • 6
    ereg_replace is deprecated from php 5.3.0 – Wiliam Sep 05 '12 at 13:13