2

i have problem encoding this character with json_encode

http://www.fileformat.info/info/unicode/char/92/index.htm

first it give me this error JSON_ERROR_UTF8 which is

'Malformed UTF-8 characters, possibly incorrectly encoded'

so tried this function utf8_encode() before json_encode

now return this result '\u0092'

so i found this one

 function jsonRemoveUnicodeSequences($struct) {
        return preg_replace("/\\\\u([a-f0-9]{4})/e", "iconv('UCS-4LE','UTF-8',pack('V', hexdec('U$1')))", json_encode($struct));
    }

the character show up but with other one

Â’

also tried htmlentities then html_entity_decode

with no result

  • 1
    What is your input encoding? Can you convert to utf8 before `json_encode`? – Halcyon May 12 '15 at 16:26
  • @Halcyon my input is object i use this function for utf8 encode function utf8ize($mixed) { if (is_array($mixed) ) { foreach ($mixed as $key => $value) { $mixed[$key] = utf8ize($value); } } else if (is_object($mixed)) { foreach ($mixed as $key => $value) { $mixed->$key = utf8ize($value); } } else if (is_string ($mixed)) { return utf8_encode($mixed); } return $mixed; } – mohamed amine hamza May 12 '15 at 16:29
  • 2
    why not simply `json_encode(iconv('UCS-4LE','UTF-8', $text))`? – Deadooshka May 12 '15 at 17:52
  • it's create error 'Detected an incomplete multibyte character in input string ' which lead me to this article http://stackoverflow.com/questions/26092388/iconv-detected-an-incomplete-multibyte-character-in-input-string which has function that i have been looking for – mohamed amine hamza May 13 '15 at 01:03
  • if found helpful function here http://stackoverflow.com/a/29667430/3479609 – mohamed amine hamza May 13 '15 at 01:07
  • hm are you sure that the `’` is what you think it is? Just copy-pasting what you typed above it's a different UTF8 entity than `\u0092` http://hexutf8.com/?q=c382e28099c292 – jar Sep 08 '16 at 14:48

2 Answers2

2

json_encode() requires input that is

  • null
  • integer, float, boolean
  • string encoded as UTF-8
  • objects implementing JsonSerializable (or whatever it's called, I'm too lazy to look it up)
  • arrays of JSON-encodable objects
  • stdClass instances of JSON-encodable objects

So, if you have a string, you must first transcode it to UTF-8. The correct tool for that is the iconv library, but you need to know which encoding the string currently has in order to correctly transcode it.

Your approach to recursively transcode arrays or objects should work, but I'd strongly suggest not using anything but UTF-8 internally. If you have an interface where you have to accept different encodings, validate and reject immediately and use UTF-8 onwards. Similarly, when replying, keep UTF-8 until the last possible point where you can still signal encoding problems.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
  • the weird problem it's stored in database as utf8_general_ci – mohamed amine hamza May 13 '15 at 00:59
  • I have UTF-8 stored in a MariaDB with utf8 collation, too. This works and you don't have to do anything to make it work for JSON either. There must be something else that you are doing. Sit down and create a minimal example, starting with creating the DB table and finally writing the content to it. Anything else is just guessing. Also, your question is not completely clear as to what you see (quote the exact content, don't paraprase!) and what you expected to see instead. – Ulrich Eckhardt May 13 '15 at 05:10
0

If you look at the link you included to the character U+0092, it is a control character, and it is also known as PRIVATE USE TWO. Its existence in your string means that your string is almost certainly not a UTF-8 string. Instead, it is probably a Windows-specific encoding, likely Windows-1252 if your text is English, in which 0x92 is a "smart quote" apostrophe, also known as a right single quotation mark. The Unicode equivalent of this character is U+2019.

Thus your data source is not giving you UTF-8 text. Either you can fix the source data to be UTF-8 encoded, or you can convert the text you receive. For example, the output of

echo iconv('Windows-1252','UTF-8', "\x92")

is

which is probably what you want. However, you want to make sure that all of your input is the same encoding. If some of your data is UTF-8 and some is Windows-1252, the above iconv call will properly handle Windows-1252 encoded apostrophes, but it will convert UTF-8 encoded apostrophes to

’
Lithis
  • 1,327
  • 8
  • 14