1

I have encountered a scenario where an email from someone in Europe keeps failing to execute. After minimizing the query I've determined that after all special characters like å and é are removed the query works fine in PHP / mysqli_query. The queries also don't work in MariaDB's command line though they do work in HeidiSQL, I imagine whatever HeidiSQL uses it internally adjusts strings used in the Query tabs.

Let's get the following out of the way:

  • Database Character Set: utf8mb4.
  • Database Collation: utf8mb4_unicode_520_ci.
  • Database column collation: utf8mb4_unicode_520_ci.
  • The correct query for the request method SET CHARACTER SET 'utf8mb4' is being correctly executed.

Here is the query:

 INSERT INTO example_table (example_column) VALUES ('Håko');

I should note that I tried the following (which also failed) even though I firmly believe that this issue occurs from and should be resolved via PHP:

 INSERT INTO example_table (example_column) VALUES (CONVERT('Håko' USING utf8));

Here is the MariaDB error:

Incorrect string value: '\xE9rard ...'

Like I said this string is originating from an email message so I'm pretty sure that the issue is with PHP, not MariaDB. So let's go backwards to that code that seems to otherwise work. Please keep in mind that this has taken at least two days to put together in the correct order to even get the strings to appear correctly in the MariaDB query log without being incorrectly converted to UTF-8 and corrupting the special Latin characters:

<?php
$s1 = '=?iso-8859-1?Q?=22G=E9rd_Tabt=22?= <berbs@example.com>';//"Gérd Tabt" <berbs@example.com>

if (strlen($s1) > 0)
{
 if (substr_count($s1, '=?') && substr_count($s1, '?= '))
 {
  $p = explode('?= ', $s1);

  $p[0] = $p[0].'?=';
  $s2 = imap_mime_header_decode($p[0])[0]->text.' '.$p[1];
 }
 else {$s2 = imap_mime_header_decode($s1)[0]->text;}

 if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2, 'UTF-8', 'CP1252');}
 else if (mb_convert_encoding($s2, 'UTF-8') == substr_count($s1, '?')) {$s2 = mb_convert_encoding($s2, 'UTF-8');}
}
else {$s2 = $s1;}
?>

There isn't any other relevant code handling this header string.

What is causing what I presume to be UTF-8 encoded strings to break PHP's mysqli_query and the MariaDB command line from working with this query?

John
  • 1
  • 13
  • 98
  • 177
  • Good code indentation would help us read the code and more importantly it will help **you debug your code** [Take a quick look at a coding standard](https://www.php-fig.org/psr/psr-12/) for your own benefit. You may be asked to amend this code in a few weeks/months and you will thank me in the end. – RiggsFolly Nov 22 '21 at 16:45
  • @RiggsFolly I respect your right to have an opinion that differs from my own. – John Nov 22 '21 at 16:47

2 Answers2

0

Where did the hex E9 come from? That is encoded latin1. Yet your configuration seems to claim that your client is encoded utf8mb4. You must have the connection charset match what the encoding is in the client. The database and table and client can have a different encoding; MariaDB is happy to convert on the fly when INSERTing or SELECTing.

For more analysis, see Trouble with UTF-8 characters; what I see is not what I stored

Rick James
  • 135,179
  • 13
  • 127
  • 222
0
  1. if (mb_convert_encoding($s2, 'UTF-8') == substr_count($s1, '?'))
    

    This makes no sense: comparing a string (converted from anything to UTF-8) against an integer (amount of matches) will only ever be equal when the converted text is '0', which is also the amount of finding '?' in it, and due to the type unsafe comparison parameter == this is the only scenario where '0' equals 0.

    So your text is never converted to UTF-8 and remains whatever it was (in this case ISO-8859-1).

  2. mb_convert_encoding($s2, 'UTF-8')
    

    Sure you want to convert to UTF-8 without telling the source encoding? ISO-8859-1 as per email header isn't the only one to expect - why not extracting that information and passing it to the function?

MariaDB is right: you're handing over ISO-8859-1 encoded text in that case, while the DBMS expects the UTF-8 encoding.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31