0

I am trying to use data from an API. I am using request for the API access, but have also tried axios.

const request = require('request')
request('https://remoteok.io/api', function (error, response, body) {
  const data = JSON.parse(body)
  console.log(data)
})

When accessing the website remoteok.io/api in a browser, I can see sequences like \u00e2\u0080\u0099. This sequence should be a backtick apostrophe, but when I log to the console in JavaScript or use express to render res.json(body), I get the characters †instead.

How can I fix this encoding issue? Shouldn't JSON always just be plain UTF-8?

UPDATE: Here is a simple glitch project that shows the behavior.

janniks
  • 2,942
  • 4
  • 23
  • 36
  • What encoding does your console use? – Phil Jul 30 '19 at 04:50
  • Also, what is the actual problem? What do you expect to see? – Phil Jul 30 '19 at 04:52
  • Possible duplicate of [JSON and escaping characters](https://stackoverflow.com/questions/4901133/json-and-escaping-characters) – Phil Jul 30 '19 at 04:53
  • I have updated the question to be more precise... When I rerender the received JSON using express, it differs. And so far I couldn't find out what the issue is. If it's a JSON parsing issue in `request` or `axios` or both; or whether it's an issue with the JSON rendering of `express`; or if there's something wrong with the actual content of the API - but a simple rerender should not convert the encoding/characters... – janniks Jul 30 '19 at 05:09
  • Sounds like [mojibake](https://en.wikipedia.org/wiki/Mojibake) at the source. `\u00e2\u0080\u0099` does not directly resolve to a backtick in any way. – deceze Jul 30 '19 at 07:22
  • The author of this particular JSON confused UTF8 byte stream encoding (a level below JSON) with the JSON syntax itself, which allows encoding Unicode Code Points. – trincot Jul 30 '19 at 07:54

2 Answers2

1

The problem is in the source data: the JSON sequence "\u00e2\u0080\u0099"does not represent a right closing quotation mark. There are three Unicode code points here, and the first represent "â", while the other two are control characters.

You can verify this in a dev console, or by running the snippet below:

console.log(JSON.parse('"\u00e2\u0080\u0099"'));

Apparently the author of that JSON mixed up two things:

  • JSON is encoded in UTF
  • A \u notation represents a Unicode Code Point

The first means that the file or stream, encoding the JSON text into bytes, should be UTF encoded (preference for UTF8). The second has nothing to do with that. JSON syntax allows to specify 16-bit Unicode Code Points using the \u syntax. It is not intended to produce a UTF8 byte sequence with a sequence1 of \u encodings. One should not be concerned about the lower-level UTF8 byte stream encoding when defining JSON text.

1 I may need to at least mention the surrogate pairs, but they are really unrelated to UTF8, but more with how Unicode Code Points beyond the 16-bit range can be encoded in JSON.

So although the right closing quotation mark has an UTF8 sequence of E2 80 99, this is not to be encoded with a \u notation for each of those three bytes.

The right closing quotation mark has Unicode Code Point \u2019. So either the source JSON should have that, or it should just have the character ’ literally (which will indeed be a UTF8 sequence in the byte stream, but that is a level below JSON)

See those two possibilities:

console.log(JSON.parse('"’"'));
console.log(JSON.parse('"\u2019"'));

And now?

I would advise you to contact the service provider of this particular API. They have a bug in their JSON producing service.

Whatever you do, do not try to fix this in your client that is using this service, trying to recognise such malformed sequences, and replacing them as if those characters represented UTF8 bytes. Such a fix will be hard to maintain, and may even hit false positives.

trincot
  • 317,000
  • 35
  • 244
  • 286
0

I think this is not an error, you can use this extension to see JSON on browser JSON Viewer

vantrong291
  • 505
  • 4
  • 6