0

I'm json_encoding some strings. Sometimes they contain binary data. This causes the encoding to fail with error code JSON_ERROR_UTF8. Running the strings through utf8_encode gets around this error. However, (a unicode checkmark) gets encoded as \u00e2\u009c\u0093 which when interpreted by JavaScript and rendered in your browser actually looks like â.

How can I fix this? Is there another encoding I can use?


echo json_encode(utf8_encode('✓')); // "\u00e2\u009c\u0093"

Now press F12 and paste that into your JavaScript console (quotes included). It should output â.


Please note that

echo json_encode('✓'); // "\u2713"

Works as intended. The issue is that sometimes the string will contain binary data which json_encode can't handle, so I need to sanitize every string without breaking the strings it can handle.


More examples:

json_encode(chr(200));              // false (bad)
json_encode(utf8_encode(chr(200)))  // "\u00c8" (good)
json_encode('✓');                   // "\u2713" (good)
json_encode(utf8_encode(chr(200)))  // "\u00e2\u009c\u0093" (bad)

So you see, encoding it works well for some strings and breaks others.

This is strictly for logging. I don't care if the binary data comes out weird, I just don't want it to mess with valid strings.

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • Can you show your example PHP and JS code? – hek2mgl Aug 21 '14 at 19:15
  • Maybe the problem relies in the document charset. Did you tried to add ``in the head of the HTML document? – collimarco Aug 21 '14 at 19:20
  • @hek2mgl I pretty much gave it to you, but nevertheless, I updated the question. – mpen Aug 21 '14 at 19:20
  • @collimarco Yes, I did. I believe the correct encoding should be `"\u2713"` -- that renders fine. – mpen Aug 21 '14 at 19:21
  • 1
    This question is unanswerable. Encoding arbitrary binary data is one thing, keeping UTF-8 characters intact is something completely separate. What's to stop `0xe29c93` from being interpreted as ✓ when it shows up in your binary data? – amphetamachine Aug 21 '14 at 19:27
  • 2
    `chr(200)` isn't a valid unicode char – hek2mgl Aug 21 '14 at 19:31

2 Answers2

1

Running strings through this function

function _utf8($str) {
    if(!mb_check_encoding($str, 'UTF-8')) {
        return utf8_encode($str);
    }
    return $str;
}

(taken and modified from here)

Seems to give the results I'm after.

Checkmarks are left alone, but chr(200) and other weirdness is encoded:

json_encode(utf8_encode(chr(200))) // "\u00c8"
Community
  • 1
  • 1
mpen
  • 272,448
  • 266
  • 850
  • 1,236
0

EDIT: This question is unanswerable. Encoding arbitrary binary data is one thing, keeping UTF-8 characters intact is something completely separate. What's to stop the byte sequence 0xe29c93 from being interpreted as when it shows up in your binary data?

According to the json_encode PHP reference page, you can use the following syntax to encode Unicode characters:

json_encode($data, JSON_UNESCAPED_UNICODE);

It should make it pass unicode characters through unescaped.

amphetamachine
  • 27,620
  • 12
  • 60
  • 72
  • Tried it already. Doesn't work: `json_encode(chr(200),JSON_UNESCAPED_UNICODE)` yields false. – mpen Aug 21 '14 at 19:25
  • re: "What's to stop..." I don't actually care if that shows up in my binary data. I just need it not break (return false) for data it can't handle. – mpen Aug 21 '14 at 19:32
  • @Mark Then transfer it into an encoding it will always be able to handle. For example, base64 encode it. – amphetamachine Aug 21 '14 at 19:34
  • That would make valid strings illegible. It's for logging. I want to be able to read the valid strings. I'll visually ignore any binary data. – mpen Aug 21 '14 at 19:36