2

Using php I parse a textfile that contains unicode characters like

Just reading-in the file without any further encoding/decoding the smiley is parsed, then json_encoded and the ouput is \u00f0\u009f\u0098\u008d

A javascript file gets the .json data and outputs the 4 escaped characters as ð

Looking at a unicode table the symbol is called "SMILING FACE WITH HEART-SHAPED EYES" and has the unicode number U+1F60D (128525)

Is there a way to convert the 4 code units to the unicodenumber or ideally to a proper html-encoded way, in this case 😍

looking at conversions, the utf 8 code units look similar (F0 9F 98 8D 0A 0A), but I can't reproduce the 4 escaped units I get, so I don't even know what I'm looking at

Update: I made a mistake and edited the second paragraph: \u00f0\u009f\u0098\u008d already is the result of json_encode();

Here is the basic function to read the data from the file, looking at the source the smiley is "hardcoded", so you actually see it

function readLocalFile() {
  $file_html = fopen('output.html', "r");
  $html = "";

  while(!feof($file_html)) {
    $html .= fgets($file_html);
  }

  fclose($file_html);

  // here I use regex to filter for specific tags, the result is an array
  $cleanData = parseData($html);

  saveToFile(json_encode($cleanData)); 
}

I just created a dummy.html with just as the content and this returns the correct result \ud83d\ude0d, in the context of the whole data it still is mangled as described above, weird

I have to look at the way the data is saved to output.html, that's where the problem has to be. I've been looking at the wrong part of the problem the whole time, d'oh!

Last Update: finally found the error. It was in the parseData-function, loadHTML somehow garbled the content, found the solution here: PHP DOMDocument loadHTML not encoding UTF-8 correctly

Community
  • 1
  • 1
John Smith
  • 21
  • 1
  • 3
  • This _IS_ UTF-8, the 0A 0A at the end are the next characters (CR or LF) already. – Ulrich Eckhardt Sep 05 '13 at 09:59
  • Your input string is broken, `\u00f0\u009f\u0098\u008d` is obviously *not* `U+1F60D` because that would be `\ud83d\ude0d`. You should take a look why you have that in the input string. This looks a bit like a misunderstood "conversion" from UTF-8 into something JSON, probably due to broken PHP json unicode handling ([#62010](https://bugs.php.net/bug.php?id=62010)?) You should share some code to say more. – hakre Sep 05 '13 at 10:28
  • Just seeing your edit. Can you provide a sample of [the string hex-encoded](http://stackoverflow.com/q/1057572/367456) before you pass it into `json_encode`? If it's very long or contains private information, maybe just reduce it to the part that makes a problem. – hakre Sep 05 '13 at 10:49

2 Answers2

1

What puzzles me with your question is the \u00f0\u009f\u0098\u008d sequence. It just does not sound like anything standardized.

As you wrote this is about Unicode Character 'SMILING FACE WITH HEART-SHAPED EYES' (U+1F60D). The \u based notation you offer seem to suggest this would be Javascript / JSON encoded unicode characters. So let's review this a little:

  • JSON uses UTF-16 surrogate pairs for anything not in the Basic Multilingual Plane (U+0000 through U+FFFF).
  • U+1F60D is not in the basic multilinguage pane.
  • It's UTF-16 encoding therefore is 0xD83D 0xDE0D
  • This is not what you have
  • It's UTF-8 encoding is xF0 0x9F 0x98 0x8D
  • This looks like what you've misused.

After this quick analysis, the answer is the following: If you can consider all \u???? sequences to be equally misused to encode UTF-8 binary sequences then all you need to do is to hook onto each of those, combine the characters that are encoded in the last hex-number which is the pair of two hexdigits at the end (position 5+6 / index 4+5) and put it together.

As this seems broken I do not suggest full-source-code here as I don't want to especially support that practice - you need to fix that in the encoding - however you can find code outlined in an answer to PHP DomDocument failing to handle utf-8 characters (☆).

So fix the input string containing wrong \u (u stands here for unicode, but it's not in your case as those imply UTF-16 not binary octets). You need to understand where those wrong \u sequences are introduced, it's not clear from your question.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
1

What you have is UTF-8 data decoded as ISO-8859-1 (latin1) to Unicode, then JSON encoded. If you:

  1. Decode the JSON to Unicode.
  2. Encode to bytes with latin-1.
  3. Decode to Unicode with UTF-8.

This should give you the the correct character. I don't do PHP, but here's a Python proof:

>>> '\u00f0\u009f\u0098\u008d'.encode('latin1').decode('utf8')
'\U0001f60d'
>>> import unicodedata as ud
>>> ud.name('\U0001f60d')
'SMILING FACE WITH HEART-SHAPED EYES'

How the data got garbled in the first place could be the HTML was actually UTF-8-encoded, but incorrectly declared ISO-8859-1 or Windows-1252.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • thanks for the explanation! I used loadHTML() and it handled the html-document I provided as ISO-8859-1 by default, so I had to declare it an UTF-8 document! As I learned, "there's an issue with DOMDocument and loadHTML garbling UTF-8 content". This answer helped me: http://stackoverflow.com/a/8218649/2748220 – John Smith Sep 06 '13 at 06:36