How replace (use regex in PHP5) invalid characters in utf-8 string on white space characters?
Asked
Active
Viewed 2.2k times
8
-
1What do you want to do? get rid of white space? or utf-8 characters? Give an example. – Quinn Wilson Sep 16 '09 at 15:19
-
4getting rid of UTF-8 characters is easy: `$text = '';` :-) – Joey Sep 16 '09 at 15:26
4 Answers
23

RageZ
- 26,800
- 12
- 67
- 76
-
7This didn't work for me. invalid characters stayed. just like it didn't work for bobef. it just doesn't do the job. – Rodniko Apr 10 '13 at 14:06
-
This worked for me. Source file was downloaded CSV of SBA franchise codes, which I manually formatted to JSON to be used in a Laravel seeder. But even though my formatted file passed JSON validation, there were hidden, invalid UTF-8 characters still in the file that PHP couldn't decode. – Ixalmida Jun 12 '17 at 21:01
-
I've not yet debugged into the details but iconv as well as mb_convert do not solve the issue with json_encode() It might help in many cases, not in all. – John Jan 13 '18 at 07:21
8
With mbstring you can do:
$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Will work as you want (replace invalid characters by whitespaces), but doesn't seem to work if you want to substitute invalid characters with something else, like ?
.
See: Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

Community
- 1
- 1

Maxime Pacary
- 22,336
- 11
- 85
- 113
3
If you have come across the cursed ‘Invalid Character‘ error while using PHP’s XML or JSON parser then you may be interested in this.
Unfortunately, PHP’s XML and JSON parsers do not ignore non-UTF8 characters, but rather they stop and throw a rather unhelpful error. I found the below code form net and work excellently for me..
//reject overly long 2 byte sequences, as well as characters above U+10000 and replace with ?
$some_string = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
'|[\x00-\x7F][\x80-\xBF]+'.
'|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
'|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'.
'|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
'?', $some_string );
//reject overly long 3 byte sequences and UTF-16 surrogates and replace with ?
$some_string = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
'|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $some_string );

George John
- 2,629
- 2
- 21
- 16
-
doesn't solve the issue with json_encode. it reports some valid UTF8 also as invalid, sadly without giving a clue what the issue is. – John Jan 13 '18 at 07:32
3
The iconv was not working my case (as other solutions) so I found mine here in the part for "Character validation":

bobef
- 990
- 1
- 9
- 14