0

According to this question JSON is automatically written using surrogate pairs.

However, this is not my experience.

Using Node 6.9.2 and the following code my file still shows characters not encoded using surrogate pairs.

const fs = require('fs')

const infile = fs.readFile('raw.json', 'utf8', (err, data) => {
    if (err) {
        throw err
    }

    data = JSON.stringify(data)

    fs.writeFile('final.json', data, 'utf8', (err) => {
      if (err) {
        throw err
      }
      console.log('done')
    })

})

In my editor, which must have good unicode and use and a font that has glyphs for these characters, the contents of the file raw.json have characters such as "題"

That character still appears in final.json (no change is made).

Additionally I tried switching the encoding utf8 to utf16le for the file being written but nothing changed.

Is there a way to force using surrogate pairs?

Community
  • 1
  • 1
Startec
  • 12,496
  • 23
  • 93
  • 160
  • No. UTF-16 requires using surrogate pairs only when needed for the defined range of Unicode codepoints. Perhaps there is a terminology problem here. You are talking about the bytes in the file, right? Bytes look like 0x4C 0x98 or similar, not like "題". – Tom Blodget May 20 '17 at 01:59
  • Sorry if I was unclear. I mean when I open the file in a simple text editor, I see characters like that. What I want to see is something like `\uff00f`. – Startec May 20 '17 at 07:48
  • I'm glad you got a good answer. The JSON syntax for an escaped UTF-16 code unit in a string literal is independent of the concept of a surrogate pair. Escapes are completely optional and can be done on a case by case basis. Now, given that JSON files and streams are [supposed](https://tools.ietf.org/html/rfc7159#page-9) to be encoded as UTF-8, UTF-16 or UTF-32, one might wonder which code units you want to escape and why you want to escape at all? [Rhetorical question; no need to answer.] – Tom Blodget May 20 '17 at 12:26

1 Answers1

5

The quoted question is misleading if one concludes that JSON.stringify will convert Unicode characters in a string, outside the Basic Multilingual Plane, to a sequence of \u escaped surrogate pair values. This answer better explains how JSON.stringify only needs to escape backslash (\), double quotation (") and control characters.

Hence if the input data contains a character occupying more than one octet (such as the `'題' used for example) it will be written to the output file as that character. If the file is successfully written and then read back using UTF16 encoding, the UTF8 encoded input character will, hopefully, be the character you see.

If the goal is to convert JSON text into ASCII using \u escaped characters for non ASCII values, and surrogate pairs for characters outside the BMP, then the JSON formatted string can be processed using simple character inspection (JSON has already converted the quote, backslash and control characters:

var jsonComponent = '"2®π≤題"'; // for example

function jsonToAscii( jsonText) {
    var s = "";
    
    for( var i = 0; i < jsonText.length; ++i) {
        var c = jsonText[ i];
        if( c >= '\x7F') {
            c = c.charCodeAt(0).toString(16);
            switch( c.length) {
              case 2: c = "\\u00" + c; break;
              case 3: c = "\\u0" + c; break;
              default: c = "\\u" + c; break;
            }
        }
        s += c;
    }
    return s;
}

console.log( jsonToAscii( jsonComponent));

This makes use of facts that JavaScript strings are already in UTF16 (so contain surrogate pairs) but are accessed as successive UCS-2 16 bit values by array notation lookup and the .charAt method. Notice that '題' is not outside the BMP and only requires two octets in UTF16, while the emoji is outside plane 0 and does require 4 octets (in UTF16).

If that is not the goal, there is at least a small possibility there is no problem.

Community
  • 1
  • 1
traktor
  • 17,588
  • 4
  • 32
  • 53