0

I'm having a problem with the webpage inconsistently not showing characters correctly for user-generated posts, such as showing ’ instead of ', for example. However, this doesn't happen all the time, some posts from users show the correct apostrophe, but others do not:

I’m having trouble and can’t figure it out

My server is running on PHP 8.1 and MySQL 8.0

Here are some of my config setups:

php.ini:
default_mimetype = "text/html"
default_charset = "UTF-8"

MySQL Variables:

character set clientDocumentation   utf8mb4
character set connectionDocumentation   utf8mb4
character set databaseDocumentation utf8mb4
character set resultsDocumentation  utf8mb4
character set serverDocumentation   utf8mb4
character set systemDocumentation   utf8mb3
character sets dirDocumentation /usr/share/mysql/charsets/
collation connectionDocumentation   utf8mb4_0900_ai_ci
collation databaseDocumentation utf8mb4_0900_ai_ci
collation serverDocumentation   utf8mb4_0900_ai_ci
default collation for utf8mb4Documentation  utf8mb4_0900_ai_ci

All databases, tables and fields have a collation set to: utf8mb4_0900_ai_ci

My HTML has the following header on all pages:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

My PHP has the following header on all pages:

header("Content-Type: text/html; charset=utf-8");

Again, these are user-generated posts. I'm not sure why it's happening, most of the users are based in the US, so I don't think it's due to international keyboards. I am hearing that it could be related to Windows or Mac devices, or users copying/pasting text from Word into the form field? Is there supposed to be some kind of PHP conversion required before saving every post to the database? Or is there something I'm supposed to be doing with PHP to display the text correctly? What is the standard practice here?

UPDATE: Added an example of PHP/MySQL code for querying and printing text to the webpage:

$conn = mysqli_connect($hostname, $username, $password, $database) or die("Database Connection Error");
$result = mysqli_query($conn, "SELECT field FROM table WHERE id = 123") or die(mysqli_error($conn));
$row = mysqli_fetch_assoc($result);
$message = htmlspecialchars($row['field'], ENT_QUOTES);
echo $message;

UPDATE: Adding more information found:

mysqli_set_charset($conn,"utf8mb4");
var_dump(mysqli_get_charset($conn));

object(stdClass)#2 (8) { 
  ["charset"]=> string(7) "utf8mb4" 
  ["collation"]=> string(18) "utf8mb4_general_ci" 
  ["dir"]=> string(0) "" 
  ["min_length"]=> int(1) 
  ["max_length"]=> int(4) ["number"]=> int(45) 
  ["state"]=> int(1) ["comment"]=> string(13) "UTF-8 Unicode" 
}

It seems PHP is using "utf8mb4_general_ci" as the collation, which is different than the default global variables set in the MySQL config. I tried setting the collation using the following below, but that did not change this:

mysqli_query($conn, 'SET collation_connection = utf8mb4_0900_ai_ci');
mysqli_query($conn, 'SET SESSION collation_connection = utf8mb4_0900_ai_ci');

Also, I discovered another MySQL table: performance_schema.session_variables , with the following:

character_set_client = utf8mb4
character_set_connection = utf8mb4
character_set_database = utf8mb3 // different than globals variable
character_set_filesystem = binary
character_set_results = utf8mb4
character_set_server = utf8mb4
character_set_system = utf8mb3
collation_connection = utf8mb4_unicode_ci // different than globals variable
collation_database = utf8mb3_general_ci  // different than globals variable
collation_server = utf8mb4_0900_ai_ci

I think this might be related to the problem. I tried to update these variables, but even the admin user is getting "permission denied". I cannot update these, nor do I know why these are being set differently than the global variables. I even rebooted MySQL and there is no change.

peppy
  • 173
  • 2
  • 17
  • I'm going to guess that they are converted smart (curly) quotes, especially if it's being copied from something else. – aynber May 09 '23 at 18:47
  • That's either UTF8 being displayed as cp1252, or UTF8 that's had a "1252 to UTF8" conversion run on it twice. As-written there is not enough information to actually diagnose the issue or offer a solution. You should read through the marked duplicate which can help you track down where this issue is being introduced. – Sammitch May 09 '23 at 18:58
  • @Sammitch - Already read through those guides, made the changes (as described above) and this did not solve the issue. The "marked duplicate" guide mentions nothing about "cp1252" or how to deal with it. – peppy May 09 '23 at 19:09
  • The incorrect quotes are MS smart quotes. They are not part of the early 1252 charset. Use UTF8 everywhere. Add the PHP code you are using to retrieve the data, and the connection string to DB you use. – user3783243 May 09 '23 at 19:28
  • Without seeing how the data gets into and out of the database, as well as what is already in there, we still cannot do more than speculate as to what the problem is. You settings may be correct _now_, but that would not necessarily fix data that was mishandled before the change, nor do correct settings necessarily prevent mishandling the data some other way. – Sammitch May 09 '23 at 20:18
  • 1
    See "Mojibake" in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored – Rick James May 10 '23 at 22:09
  • I believe what I am seeing is "Mojibake" as Rick James stated. I don't know why everything suddenly broke, I didn't make any change to the databases or website. Not sure what the easiest way of fixing it is either. What I've done is search for mojibake (like ð characters) in my database fields, look up each corresponding emoji online and then run: "UPDATE table SET field = REPLACE(field, '😀', ''); " - like hundreds of individual queries... is there a better way of fixing this??? – peppy May 12 '23 at 21:08
  • Still need you to post the code. – user3783243 May 12 '23 at 23:51
  • @user3783243 I've added an example above. – peppy May 13 '23 at 00:15
  • @peppy Don't see your mysqli connection being set to UTF8 https://stackoverflow.com/questions/10829816/set-character-set-using-mysqli – user3783243 May 13 '23 at 06:38
  • I have the connection collation set in MySQL, the default configuration: "collation connection: utf8mb4_0900_ai_ci" . From what I can see, PHP is using the default utf8mb4 charset as no other is specified. – peppy May 13 '23 at 19:29
  • @user3783243 I updated the connection with mysqli_set_charset($conn,"utf8mb4"); and checked the charset before/after this statement, and it was the same "utf8mb4". However, after looking deeper into "mysqli_get_charset", I am seeing that the collation is "utf8mb4_general_ci". It is supposed to be "utf8mb4_0900_ai_ci". Do you know how to fix this? It doesn't seem like it's picking up the default value set in the MySQL config. I tried "mysqli_query($conn, 'SET SESSION collation_connection = utf8mb4_0900_ai_ci');", but that makes no difference. – peppy May 13 '23 at 20:13
  • Added more code and other MySQL session variable information above. – peppy May 13 '23 at 21:42
  • Nevermind my comment about "mysqli_get_charset" showing the wrong collation, just found out the info coming from that function is bugged/irrelevant/inaccurate: https://stackoverflow.com/questions/76250010/mysqli-get-charset-showing-wrong-collation-how-to-update/76250085#76250085 – peppy May 14 '23 at 23:44
  • The `COLLATION` is irrelevant to the original question, only the `CHARACTER SET`, which needs to be `utf8mb4` (or utf8mb3`). – Rick James May 17 '23 at 17:57
  • Check `SELECT HEX(col)...` to see if `’` is `E28099` (simple Mojibake) or `C3A2E282ACE284A2` ("double encoding"). – Rick James May 17 '23 at 17:59
  • @RickJames It looks like this is double encoding. When I run `SELECT HEX(col)` on `≤`, I get: `C3A2E280B0C2A4` – peppy May 17 '23 at 21:27
  • This discusses that problem: https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored and has some clues or links to clues on how to fix it. However, if you have some Mojibake and some double-encoding in the same column, it could get tricky to fix. Since it is difficult to get a Question "reopened", you could start a new Question with more of the gory details and request help in "fixing the data". – Rick James May 17 '23 at 21:33
  • @RickJames Ah, I see. I stumbled across that solution on another post, but it didn't quite work. I added another question about it here: https://stackoverflow.com/questions/76267990/mysql-convert-function-deleting-data-invalid-characters-how-to-prevent-delet . It seems that lots of my data is getting truncated and even completely deleted. I could use the "IFNULL" that you suggested to prevent complete deletion, but do you know a solution to prevent silent truncation if I am running this function on 1M product descriptions? I had to restore my database. – peppy May 17 '23 at 21:40
  • I am also seeing characters like `é` showing as `é`. I found that it's doing this for foreign language characters as well, not just emojis. – peppy May 17 '23 at 21:48
  • é is hex E9 in latin1. é is simple Mojibake. (hex C3A9). Double encoding is hex C383C2A9 – Rick James May 17 '23 at 22:07

0 Answers0