Using php I parse a textfile that contains unicode characters like
Just reading-in the file without any further encoding/decoding the smiley is parsed, then json_encoded and the ouput is \u00f0\u009f\u0098\u008d
A javascript file gets the .json data and outputs the 4 escaped characters as ð
Looking at a unicode table the symbol is called "SMILING FACE WITH HEART-SHAPED EYES" and has the unicode number U+1F60D
(128525)
Is there a way to convert the 4 code units to the unicodenumber or ideally to a proper html-encoded way, in this case 😍
looking at conversions, the utf 8 code units look similar (F0 9F 98 8D 0A 0A), but I can't reproduce the 4 escaped units I get, so I don't even know what I'm looking at
Update: I made a mistake and edited the second paragraph: \u00f0\u009f\u0098\u008d
already is the result of json_encode();
Here is the basic function to read the data from the file, looking at the source the smiley is "hardcoded", so you actually see it
function readLocalFile() {
$file_html = fopen('output.html', "r");
$html = "";
while(!feof($file_html)) {
$html .= fgets($file_html);
}
fclose($file_html);
// here I use regex to filter for specific tags, the result is an array
$cleanData = parseData($html);
saveToFile(json_encode($cleanData));
}
I just created a dummy.html with just as the content and this returns the correct result
\ud83d\ude0d
, in the context of the whole data it still is mangled as described above, weird
I have to look at the way the data is saved to output.html
, that's where the problem has to be. I've been looking at the wrong part of the problem the whole time, d'oh!
Last Update: finally found the error. It was in the parseData-function, loadHTML somehow garbled the content, found the solution here: PHP DOMDocument loadHTML not encoding UTF-8 correctly