7

I cannot solve this issue and I'm getting crazy.

JSON_encode() is casting the error: Malformed UTF-8 characters, possibly incorrectly encoded on few records (2 or 3) from a set of 10k records. However this seems very impossible to fix.

  • mysql is already utf8mb4 everywhere (database, table, columns and collation)
  • php is 7.2 and of course in utf8
  • apache default charset is utf8 (however the error is throw at PHP-level).

I can also print to screen correctly the record in PHP without issue in a simple HTML debug page. However If I try to encode it in JSON I get the error.

I found that these records have been imported from a CVS probably bypassing the cleaner. What is so strange is that the entire CSV file is parsed with:

$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);
if ($this->encoding!="" && $this->encoding!="UTF8") {
    $source = iconv($this->encoding, "UTF-8", $source);
} 

I cannot post any full broken data due to the privacy (and GDPR). However I succeed to extract a part which seems to be the broken one:

RESIDENCE �PRINCIPE

UPDATES

I try to get the bitcode of these broken chars. This is what I found. In ASCII by using simple native function str_split and ord these char is:

'�' 160

I would like to find the bitcode also in utf8, so I find this usefull function on PHP.net http://php.net/manual/en/function.ord.php#109812 Which try to find bitcode of MultiByteStrings. and it gives me:

-2096

Which is....... negative?

Max Cuttins
  • 614
  • 2
  • 6
  • 20
  • 2
    Were you going to show us an example of the "broken" data? – Ignacio Vazquez-Abrams May 30 '18 at 18:10
  • My unique example of broken data are wirebank transfers records, i cannot post these details on stackoverflown. Sorry (GDPR you know). – Max Cuttins May 31 '18 at 08:07
  • Can anybody let me know why people should downvote a question like this? Because really it seems to be just for badness. – Max Cuttins May 31 '18 at 08:09
  • You're getting the data from the database? Have you set the ***database connection encoding*** to `utf8mb4`? Note: not the column collation/charset, the *connection encoding*. – deceze May 31 '18 at 08:19
  • As I wrote: mysql is already utf8mb4 everywhere (database, table, columns and collation) – Max Cuttins May 31 '18 at 08:28
  • Again, I'm not talking about the database, table, columns or collations. I'm talking about the *connection encoding*. Common oversight. See https://stackoverflow.com/a/279279/476. – deceze May 31 '18 at 08:33
  • Ok sorry... you have right, connection encoding is how PHP open the mysqli connection. Right, and yes it is in UTF8 (sorry this is hardcoded in the connection function since ages and I'm so used to have it in UTF8 that I miss to write this down in the post). But yes of course is in UTF8. $db->set_charset("utf8"); – Max Cuttins May 31 '18 at 08:38
  • OK, then what is the binary value of that string? `echo bin2hex($str)` – deceze May 31 '18 at 08:40
  • I found bitcode of the broken char some minutes ago and I wrote in the question (check my edit above). I found that the broken char is `160` in ASCII. I found that this is the `NBSP` char in `windows-1252 charset` which is probably the original charset of the data. What I can't understand is how this char find to bypass the filter: `$this->encoding = mb_detect_encoding($source,mb_detect_order(),true); if ($this->encoding!="" && $this->encoding!="UTF8") { $source = iconv($this->encoding, "UTF-8", $source); } ` – Max Cuttins May 31 '18 at 08:43
  • "Bitcode `160`" is… not really a thing. Use `bin2hex`, that's simple, clear, and straight forward. – deceze May 31 '18 at 08:55
  • What is 160 depend by charset. In Windows-1252 is NBSP: https://en.wikipedia.org/wiki/Windows-1252 – Max Cuttins Jun 01 '18 at 09:51

2 Answers2

6

SOLVED!

The issue was in the function mb_detect_order(), this function just don't work as I was expecting. I was thinking this was a list of full supporting encoding order by mostly used in order to speed up the detection process.

But I just found that this function return just 2 encoding:

//print_r(mb_detect_order());
Array
(
    [0] => ASCII
    [1] => UTF-8
)

Which is almost completly useless in my case. MB functions can detect much more charset. You can check them out by run mb_list_encodings() and get the full list:

//print_r(mb_list_encodings());
Array
(
    [0] => pass
    [1] => auto
    [2] => wchar
    [3] => byte2be
    [4] => byte2le
    [5] => byte4be
    [6] => byte4le
    [7] => BASE64
    [8] => UUENCODE
    [9] => HTML-ENTITIES
    [10] => Quoted-Printable
    [11] => 7bit
    [12] => 8bit
    [13] => UCS-4
    [14] => UCS-4BE
    [15] => UCS-4LE
    [16] => UCS-2
    [17] => UCS-2BE
    [18] => UCS-2LE
    [19] => UTF-32
    [20] => UTF-32BE
    [21] => UTF-32LE
    [22] => UTF-16
    [23] => UTF-16BE
    [24] => UTF-16LE
    [25] => UTF-8
    [26] => UTF-7
    [27] => UTF7-IMAP
    [28] => ASCII
    [29] => EUC-JP
    [30] => SJIS
    [31] => eucJP-win
    [32] => EUC-JP-2004
    [33] => SJIS-win
    [34] => SJIS-Mobile#DOCOMO
    [35] => SJIS-Mobile#KDDI
    [36] => SJIS-Mobile#SOFTBANK
    [37] => SJIS-mac
    [38] => SJIS-2004
    [39] => UTF-8-Mobile#DOCOMO
    [40] => UTF-8-Mobile#KDDI-A
    [41] => UTF-8-Mobile#KDDI-B
    [42] => UTF-8-Mobile#SOFTBANK
    [43] => CP932
    [44] => CP51932
    [45] => JIS
    [46] => ISO-2022-JP
    [47] => ISO-2022-JP-MS
    [48] => GB18030
    [49] => Windows-1252
    [50] => Windows-1254
    [51] => ISO-8859-1
    [52] => ISO-8859-2
    [53] => ISO-8859-3
    [54] => ISO-8859-4
    [55] => ISO-8859-5
    [56] => ISO-8859-6
    [57] => ISO-8859-7
    [58] => ISO-8859-8
    [59] => ISO-8859-9
    [60] => ISO-8859-10
    [61] => ISO-8859-13
    [62] => ISO-8859-14
    [63] => ISO-8859-15
    [64] => ISO-8859-16
    [65] => EUC-CN
    [66] => CP936
    [67] => HZ
    [68] => EUC-TW
    [69] => BIG-5
    [70] => CP950
    [71] => EUC-KR
    [72] => UHC
    [73] => ISO-2022-KR
    [74] => Windows-1251
    [75] => CP866
    [76] => KOI8-R
    [77] => KOI8-U
    [78] => ArmSCII-8
    [79] => CP850
    [80] => JIS-ms
    [81] => ISO-2022-JP-2004
    [82] => ISO-2022-JP-MOBILE#KDDI
    [83] => CP50220
    [84] => CP50220raw
    [85] => CP50221
    [86] => CP50222
)

I was in wrong, thinking that mb_detect_order was just an ordered version of this list. The mb_detect_order is just.... useless. In order to encode in UTF8 in the right way use the following code:

$my_encoding_list = [
    "UTF-8",
    "UTF-7",
    "UTF-16",
    "UTF-32",
    "ISO-8859-16",
    "ISO-8859-15",
    "ISO-8859-10",
    "ISO-8859-1",
    "Windows-1254",
    "Windows-1252",
    "Windows-1251",
    "ASCII",
    //add yours preferred
];

//remove unsupported encodings
$encoding_list = array_intersect($my_encoding_list, mb_list_encodings());

//detect 'finally' the encoding
$this->encoding = mb_detect_encoding($source,$encoding_list,true);

This worked and solved my issue with bad data saved in the database.

Max Cuttins
  • 614
  • 2
  • 6
  • 20
1

You can filter these unknown characters by using the UTF-8//IGNORE charset in your iconv method.

$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);

if ($this->encoding!="" && $this->encoding!="UTF8") {
    $source = iconv($this->encoding, "UTF-8//IGNORE", $source);
} 

By using the //IGNORE after your charset, every characters that cannot be represented in the target charset will be silently discarded.

fxbt
  • 2,556
  • 19
  • 19