I am working on localhost windows10 apache 2.4: Apache/2.4.51 (Win64) OpenSSL/1.1.1l PHP/8.0.11
and Database client version: libmysql - mysqlnd 8.0.11
which uses the server Server version: 10.4.21-MariaDB - mariadb.org binary distribution
. It is by default set to _utf8mb4: Server charset: UTF-8 Unicode (utf8mb4)
.
I made a php script that gets content(including html tags) from a Wikipedia page using loadHTMLFile
. I then further use xpath->query
to filter the dom and then the data is saved in mysql table as a string after being escaped by mysqli_real_escape_string
. Later on, I query the database and save the content in a variable which is passed to loadHTML
, then I remove a few dom elements and then pass the modified content to saveHTML
and echo it to my webpage.
What happens is some characters are being displayed like:
--> Â
- --> –
€ --> €
ευρώ --> ευÏÏŽÂ
All the characters are displayed correctly, when I use echo utf8_decode($output)
. Note: that instead of using utf8_decode
, any of the following has no effect:
<meta charset="utf-8"> // in my html file
header('Content-Type: text/html; charset=utf-8'); // before the echo statement
mysqli_query($conn, "SET NAMES utf8"); // before mysql insert into and Select from statements
mysqli_set_charset($conn, "utf8"); // before mysql insert into and Select from
statements
Also both mb_detect_encoding($output)
and mb_detect_encoding(utf8_decode($output))
returns UTF-8
not utf8mb4
. In my chrome browser's network/headers tab, I always get Content-type as text/html; charset=UTF-8
, regardless of whatever changes I make in my server side php/mysql settings.
My guess is that, the data in the Wikipedia page is in normal UTF-8
form, which is automatically converted by php into utf8mb4
when it's downloaded by loadHTMLFile
. Now this data is saved in mysql tables in utf8mb4
format. This data when retrieved later on stays in utf8mb4
format and is seen to the browser in utf8mb4
format. When I use utf8_decode
it must convert it to normal utf-8
format.
The problem with my guess is that the php docs about utf8_decode page, mention nothing of utf8mb4
, rather it says, multi-byte UTF-8 ISO-8859-1
encoding is converted into single byte UTF-8 ISO-8859-1
. Secondly the docs say, ISO-8859-1 charset does not contain the EURO sign. But my webpage successfully shows euro sign after utf8_decode
and a browser is capable of parsing multibyte utf-8 characters as well, so if that was the only thing that utf8_decode
does, then it should not make any difference with my code.
Edit:
I found the culprit. The following echos correct characters:
$stmt = $conn->prepare("Select ...");
...
$result = $stmt->execute();
...
$row = $stmt->get_result()->fetch_assoc()
echo $row['content']; // gives €ερυώ
Now, $row['content']
is the data directly from my database without any utf_decode. But I happen to use php domdocument
afterwards and the following happens:
libxml_use_internal_errors(true); // important
$content = new DOMDocument();
$content->loadHTML($row['content']);
echo $row['content'], $content->saveHTML($content); die();
// The output is: €ερυώ
//â¬ÎµÏÏÏ
The output from the above code in my view source is:
€ερυώ<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>â¬ÎµÏÏÏ</p></body></html>
So please explain what the heck does loadHTML
and saveHTML
is doing here?
P.S: My whole code available on github repo: https://github.com/AnupamKhosla/crimeWiki and the speciic scripts about wikipedea pages encoding at https://github.com/AnupamKhosla/crimeWiki/blob/main/include/wikipedea_code.php https://github.com/AnupamKhosla/crimeWiki/blob/main/include/post_code.php