0

I am working on localhost windows10 apache 2.4: Apache/2.4.51 (Win64) OpenSSL/1.1.1l PHP/8.0.11and Database client version: libmysql - mysqlnd 8.0.11 which uses the server Server version: 10.4.21-MariaDB - mariadb.org binary distribution. It is by default set to _utf8mb4: Server charset: UTF-8 Unicode (utf8mb4).

I made a php script that gets content(including html tags) from a Wikipedia page using loadHTMLFile. I then further use xpath->query to filter the dom and then the data is saved in mysql table as a string after being escaped by mysqli_real_escape_string. Later on, I query the database and save the content in a variable which is passed to loadHTML, then I remove a few dom elements and then pass the modified content to saveHTML and echo it to my webpage.

What happens is some characters are being displayed like:

  --> Â
- --> –
€ --> €
ευρώ --> ευÏÏŽÂ

All the characters are displayed correctly, when I use echo utf8_decode($output). Note: that instead of using utf8_decode, any of the following has no effect:

<meta charset="utf-8">  // in my html file  
header('Content-Type: text/html; charset=utf-8'); // before the echo statement      
mysqli_query($conn, "SET NAMES utf8"); // before mysql insert into and Select from statements 
mysqli_set_charset($conn, "utf8"); // before mysql insert into and Select from 

statements

Also both mb_detect_encoding($output) and mb_detect_encoding(utf8_decode($output)) returns UTF-8 not utf8mb4. In my chrome browser's network/headers tab, I always get Content-type as text/html; charset=UTF-8 , regardless of whatever changes I make in my server side php/mysql settings.

My guess is that, the data in the Wikipedia page is in normal UTF-8 form, which is automatically converted by php into utf8mb4 when it's downloaded by loadHTMLFile. Now this data is saved in mysql tables in utf8mb4 format. This data when retrieved later on stays in utf8mb4 format and is seen to the browser in utf8mb4 format. When I use utf8_decode it must convert it to normal utf-8 format.

The problem with my guess is that the php docs about utf8_decode page, mention nothing of utf8mb4, rather it says, multi-byte UTF-8 ISO-8859-1 encoding is converted into single byte UTF-8 ISO-8859-1. Secondly the docs say, ISO-8859-1 charset does not contain the EURO sign. But my webpage successfully shows euro sign after utf8_decode and a browser is capable of parsing multibyte utf-8 characters as well, so if that was the only thing that utf8_decode does, then it should not make any difference with my code.

Edit:

I found the culprit. The following echos correct characters:

$stmt = $conn->prepare("Select ...");
...
$result = $stmt->execute();
...
$row = $stmt->get_result()->fetch_assoc()

echo $row['content']; // gives €ερυώ  

Now, $row['content'] is the data directly from my database without any utf_decode. But I happen to use php domdocument afterwards and the following happens:

libxml_use_internal_errors(true); // important
$content = new DOMDocument();
$content->loadHTML($row['content']);
echo $row['content'], $content->saveHTML($content); die();
// The output is:  €ερυώ
//â¬ÎµÏÏÏ

The output from the above code in my view source is:

€ερυώ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>â¬ÎµÏÏÏ</p></body></html>

So please explain what the heck does loadHTML and saveHTML is doing here?


P.S: My whole code available on github repo: https://github.com/AnupamKhosla/crimeWiki and the speciic scripts about wikipedea pages encoding at https://github.com/AnupamKhosla/crimeWiki/blob/main/include/wikipedea_code.php https://github.com/AnupamKhosla/crimeWiki/blob/main/include/post_code.php

user31782
  • 7,087
  • 14
  • 68
  • 143
  • 1
    https://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Sammitch Dec 21 '21 at 21:52
  • What version of MySQL? (`SELECT @@version;`) – Rick James Dec 23 '21 at 07:11
  • "ISO-8859-1" is probably closest to MySQL's "latin1". – Rick James Dec 23 '21 at 07:15
  • Now, that you posted the edit in your question, it is clear that the problem is not with mysqli. The problem is that you didn't specify the correct encoding for your DOMDocument. See https://stackoverflow.com/a/47396055/1839439 – Dharman Dec 23 '21 at 16:29
  • @Dharman Interestingly, opposite to the link you gave, When I load a wikipedea url with `$tmp = new DOMDocument(); $tmp->loadHTMLFile($url)` all the characters are encoded in utf-8 and the data saved in my database is stillin utf-8 format. It is when I query that data and create html dom back from it with `$content = new DOMDocument(); $content->loadHTML($row['content']); `, I get encoding problems. – user31782 Dec 23 '21 at 16:40

2 Answers2

3

The fact that utf8_decode() helps you is incidental. This function should not be used most of the time. If using it helps you, then it can only mean that somehow you have managed to mangle your data.

utf8mb4 is MySQL's charset that represents the full UTF-8 encoding. Therefore, if you are using UTF-8 everywhere in your code, you should never need to use utf8_decode() as it will only damage the data. ISO-8859-1 supports very few characters. It's not what you want.

What seems to have happened here is that you forgot to set $conn->set_charset('utf8mb4') when you opened the connection. Many MySQL servers default to Latin1 when you don't specify the charset, which means that even though your schema might be using utf8mb4 consistently, the connection to the database doesn't and converts the data into garbled up text.

The solution is simple, always set the right connection charset right after opening a new mysqli connection. $conn->set_charset('utf8mb4') will solve your problem and you don't need to use the ridiculous utf8_decode() function that accidentally solved your problem.

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • Additionally, the € symbol was surviving the process because MySQL also lies when it claims Latin-1. In fact it's using Windows-1252. ([Reference](https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html)). – Álvaro González Dec 22 '21 at 08:30
  • Unfortunately `$conn->set_charset('utf8mb4')` did not change anything, as I mentioned in the question. My phpmyadmin page mentions `Server connection collation: utf8mb4_unicode_ci` and `Server charset: UTF-8 Unicode (utf8mb4)`. Also, the php ini file mentions `UTF-8` as well. Could it be that `loadHTML` and `xpath->query` are garbling the character encoding somewhere or maybe wikipedea source is not utf-8? – user31782 Dec 22 '21 at 11:30
  • No, this has nothing to do with phpMyAdmin. Don't look there. Just delete all your stored data, set the charset, and double check that your columns have the right charset too. Make sure you set the connection charset when you insert the data as well when you select – Dharman Dec 22 '21 at 11:54
  • Hello again, I did delete everything and I use `set_charset('utf8mb4')` in the function which returns me the `$conn` itself. More importantly, I found where the data is mangled. Please see the edit section in my question. – user31782 Dec 23 '21 at 16:15
1

Using any encode/decode is a symptom of misconfiguration.

When you connect to mysql, you tell it what encoding is being used in the client.

When you declare the tables, you specify how to store things. CHARACTER SET utf8mb4 is often the best.

Please provide SELECT HEX(col), col ... for a sample. (You cannot trust what the browser displays; it tries to "fix" the encoding. Once you have the hex, we can discuss how to repair the data. A common problem is "double-encoding", wherein the data has been misconverted twice.

As for your current samples, there are enough inconsistencies that I cannot deduce what went wrong:

&nbsp; ->  That is represented as hex 80 by some word processors, not by HTML.
-  --> this is a plain dash; it is never mangled.  Perhaps you have an n-dash or m-dash?
€  --> mangles to "€" via "Mojibake" through latin1;
       did you omit the "SINGLE LOW-9 QUOTATION MARK" that looks like a comma??
ευρώ  --> ευÏÏŽ via "Mojibake" through latin1;

More on Mojibake and other common manglings: Trouble with UTF-8 characters; what I see is not what I stored

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I did a sample `SELECT HEX(col), col ...` and it returned everything in utf8 encoding which was shown correctly in the browser. I got `2D2D2D2D E282ACCEB5CF81CF85CF8E` for `---- €ερυώ`. I think the culprit is `xpath` or `DOMDocument`. I will get back to you with hex values on my main page where I see the actual bug. – user31782 Dec 23 '21 at 15:46
  • The problem with the main page is that the `$result` of `mysqli::query("SELECT HEX(col)")` is being fed into `htmlspecialchars`, `loadHTML` and `saveHTML`. So feeding them with hex codes give me error. Let me find a way out. – user31782 Dec 23 '21 at 15:52
  • Actualy `SELECT ... FROM ..` your table. This is to see what is in the table. – Rick James Dec 23 '21 at 16:06
  • I did perform it as `SELECT ... FROM ...` please see my **edit** in the question, I found the bug point which mangles the data. – user31782 Dec 23 '21 at 16:12
  • `–` is Mojibake for "En dash". – Rick James Dec 23 '21 at 22:49