0

Possible Duplicate:
PHP: replace invalid characters in utf-8 string in

I have a string that has an invalid character in it (it's not UTF-8) such as the following displaying SUB:

enter image description here

I think it's some kind of foreign invalid character.

Is there a way in PHP to take a string and use preg_replace or something else to ensure that I am only using valid UTF-8 characters in my strings, and anything else just gets removed?

Thanks.

Community
  • 1
  • 1
Ethan Allen
  • 14,425
  • 24
  • 101
  • 194

2 Answers2

1

First of all, there is no invalid UTF-8 characters. There are invalid UTF-8 bytes and byte sequences, which means someone is trying to pull off an encoding attack on your server. These can be validated with mb_check_encoding on the coming input data, and immediately failing with 400 Bad Request if you don't get valid UTF-8.

What you have is just the SUBSTITUTE control character, a valid character but unprintable.

Originally intended for use as a transmission control character to indicate that garbled or invalid characters had been received. It has often been put to use for other purposes when the in-band signaling of errors it provides is unneeded, especially where robust methods of error detection and correction are used, or where errors are expected to be rare enough to make using the character for other purposes advisable.

You can use this regex to get rid of it (and a few others):

$reg = '/(?![\r\n\t])[\p{Cc}]/u';

preg_replace( $reg, "", $str );
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • UTF-8 characters vs UTF-8 bytes? Can you elaborate? – Salman A Jan 11 '13 at 13:14
  • Characters = `ABCΩ`. UTF-8 Bytes for those: `0x41 0x42 0x43 0xce 0xa9`. The term `UTF-8 character` does not make sense, UTF-8 is an encoding, something that describes how characters are represented as concrete bytes. – Esailija Jan 11 '13 at 13:31
  • I am wondering if you meant _there are invalid UTF-8 byte *sequences*_. – Salman A Jan 11 '13 at 13:40
  • 1
    @SalmanA true, I have now added it. There are also plenty of invalid bytes, bytes that are never valid, not even in any sequence. – Esailija Jan 11 '13 at 13:41
0

The mb_check_encoding function should be able to do this.

mb_check_encoding("Jetzt gibts mehr Kanonen", "UTF-8");

Note: I haven't tested this.

Powerlord
  • 87,612
  • 17
  • 125
  • 175
  • This just returns a true or false and doesn't actually detect/tell me which character is valid or remove it. Close though! – Ethan Allen Jan 10 '13 at 22:16
  • If you need to check which characters are invalid, you could iterate over your string and call `mb_check_encoding` on each character and remove them if it returns false. – Kyle Jan 10 '13 at 22:39
  • @Kyle you mean each byte, and it would return false for any multi-byte encoding byte then, except the ASCII ones in UTF-8. – Esailija Jan 11 '13 at 11:47