2

I have a lot of strings in our MySQL database that have control characters such as ^M. I want a regex that removes it in PHP, but leaves alone things such as new lines, eg: "\n".

I've tried the following:

preg_replace('/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F]/', '', $bad);

This seems to leave it in place.

What's the best way to get rid of these control characters?

randombits
  • 47,058
  • 76
  • 251
  • 433
  • have an example of such string? – RomanPerekhrest Feb 02 '17 at 22:23
  • You could perhaps approach this from the DB end: http://dba.stackexchange.com/questions/97518/how-to-identify-and-remove-the-sql-table-row-which-contains-utf-8-bom-characters – scrowler Feb 02 '17 at 22:23
  • @RomanPerekhrest I'm not entirely sure how to enter the control character ^M. In vim you can do ctrl+v+m, but can't just copy/paste that output here. Suggestions? – randombits Feb 02 '17 at 22:25
  • Of course this regex leaves newlines (0x0A) behind, they aren't matched. If you want to know what the regex should be then you'll need to be a lot more specific than "control characters" and what character set you are using. – symcbean Feb 02 '17 at 22:28
  • Ideally I think everything from dec 0-31 should be matched, and leave "\r\n" alone. http://www.asciitable.com/ – randombits Feb 02 '17 at 22:32
  • Use `preg_replace("/(?![\r\n])[[:cntrl:]]/", "", $bad);` – Wiktor Stribiżew Feb 02 '17 at 22:56
  • Been over a year since this question was asked, but the ^M is Carriage Return which is what Windows and Macs use. Windows uses \r\n and Macs use \n\r to my knowledge. So if you have ^M in your database, it means that the client is using a computer that's either a Windows PC or an Apple. – Daniel Rudy Jul 29 '18 at 05:38

2 Answers2

6

I want a regex that removes it in PHP, but leaves alone things such as new lines, eg: "\n"

Use the following approach:

preg_replace("/(\x0A)|[[:cntrl:]]/", "$1", $bad);

\x0A - points to a newline character

[[:cntrl:]] - represents all control characters

(\x0A)|[[:cntrl:]] - alternation group which matches either a newline character or some of control characters at one time.

$1 holds the first capturing group that is newline character only if it was matched

RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
1

You can use this replacement:

$result = preg_replace('~[^\P{Cc}\r\n]+~u', '', $str);

\p{Cc} is the unicode character class for control characters. \P{Cc} is the opposite (all that is not a control character).

[^\P{Cc}\r\n] is all that isn't \P{Cc}, \r and \n.

The u modifier ensures that the string and the pattern are read as utf8 strings.

If you want to preserve an other control character, for example the TAB, add it to the negated character class: [^\P{Cc}\r\n\t]

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • For whatever reason this one doesn't work for me. The answer by @RomanPerekhrest below does. I'm trying to understand why your example doesn't work on a string that most certainly has `^M` characters in it. – randombits Feb 02 '17 at 23:09
  • @randombits: I think that `^M` stands for a carriage return (CR), if you don't want carriage returns in your string, remove `\r` from the negated character class. – Casimir et Hippolyte Feb 02 '17 at 23:14
  • @randombits: note that is your goal is only to change newline windows character sequence to newline unix/linux character sequence `str_replace("\r\n", "\n", $str)` or `str_replace("\r", "", $str)` should suffice. – Casimir et Hippolyte Feb 02 '17 at 23:55