0

I'm importing emails from a Gmail account and storing them in a database. Apparently my PHP logic to convert everything to UTF-8 has failed with one particular email and when I run the query with that email I receive the following MariaDB error:

Incorrect string value: '\x96 conv...'

Here is the minimized version of the SQL query:

INSERT INTO mail (subject) VALUES ('orkut – convite...');

I eventually tracked this down to a header subject string. In my general MariaDB query log the string appears as intended with the en dash (): orkut – conv.... When PHP attempts to run the query I get the error Incorrect string value as mentioned above however when I manually run the exact same query via HeidiSQL the query executes just fine!

The header is split between the key and value and then the value is processed as follows:

$p = explode(': ', $header, 2);
$s = mb_convert_encoding(trim($p[1]), 'UTF-8', mb_detect_encoding($p[1]))

Now I've tried many things and the closest I could get was cmbuckly's answer on another thread however using $s = iconv('utf-8', 'windows-1252', $s); did not resolve the problem in this circumstance. I have heaps of other attempts like using utf8_encode($s) afterwards though I'm just at a loss at this point.

  • My MariaDB database has the proper utf8mb4 character set and utf8mb4_unicode_520_ci database collation.
  • I am not going to try to use MariaDB to encode/re-encode text, this is explicitly intended as a PHP question.

What is causing MariaDB to not work with the query via PHP though allows the query to work via HeidiSQL? Have I properly decoded the string?

John
  • 1
  • 13
  • 98
  • 177

1 Answers1

1

https://www.fileformat.info/info/unicode/char/2013/charset_support.htm

The only time an en-dash has a 0x96 byte in it is when a windows cp125x encoding is used.

Figure out which one your source data uses and convert it to UTF8 with

$utf8_data = mb_convert_encoding($data, 'UTF-8', 'cp125x');

Additionally: UTF-8 all the way through

Sammitch
  • 30,782
  • 7
  • 50
  • 77
  • This directly answers the question perfectly - thank you. The overall issue with no doubt likely extends to numerous other characters that when in a string causes PHP to fail to determine the proper character encoding so after much work I've put together a thread to help identify more troublesome characters so we can avoid data corruption: https://stackoverflow.com/questions/70045981/list-of-known-troublesome-characters-that-causes-php-to-fail-to-detect-the-prope. – John Nov 20 '21 at 12:42