15

I’ve done some Google searches, but I get results related to encoding strings or files.

Can I write my Node.js JavaScript source code in UTF-8? Can I use non-ASCII characters in comments, strings, or as variable names?

ECMA-262 seems to require UTF-16 encoding, but Node.js won’t run a UTF-16 encoded .js file. It will, however run UTF-8 source and correctly interpret non-ASCII characters.

So is this by design or by “accident”? Is it specified somewhere that UTF-8 source code is supported?

Nate
  • 18,752
  • 8
  • 48
  • 54
  • 1
    I've never given this a second though, but I constantly use UTF-8 for everything I do and never had a problem. – Alex Turpin Apr 12 '12 at 14:05
  • 1
    I expect that it's not so much a Node.js thing, but a V8 thing. – Pointy Apr 12 '12 at 14:07
  • 1
    I was hoping someone could point to, say, Node.js or V8 documentation that says what source encodings are allowed. (Python example: http://www.python.org/dev/peps/pep-0263/). Yeah, I can and did futz around and see what works, but I want a more concrete answer. – Nate Apr 12 '12 at 15:12
  • You're linking to a very old version of the spec (3rd rev. is from 1999, we just hit 6th rev. last June). The current version is [here](http://www.ecma-international.org/ecma-262/6.0/index.html#sec-source-text). The requirement is "unicode" (with, by convention, ASCII being a subset of unicode, since the lower 127 codepoints in unicode are the same as the ASCII encoding specifies) – Mike 'Pomax' Kamermans Sep 11 '15 at 17:07
  • Hi @Nate , it seems some years have past from when you asked this question. I'm seeking for something like the Python example you wrote in the comment. Had you found a concrete answer in the meanwhile? – Daniele Ricci Nov 11 '21 at 12:42
  • The answer from @Mike'Pomax'Kamermans is correct. The current version now is [here](https://262.ecma-international.org/12.0/#sec-ecmascript-language-source-code). It says it must be Unicode “regardless of the external source text encoding”. That means it’s a Node implementation detail. I can’t find the requirement in Node.js but of course UTF-8 is the de-facto standard encoding now (even moreso than it was in 2012 when I asked the question.) We use UTF-8 for all Node.js code at my company and it has worked well. – Nate Nov 12 '21 at 17:48
  • Thank you @Nate , we too use UTF-8, my target now is to avoid bidirectional UNICODE text in order to avoid _trojan source_ security holes. – Daniele Ricci Nov 12 '21 at 17:55
  • ECMA-262: "The actual encodings used to store and interchange ECMAScript source text is not relevant to this specification" https://262.ecma-international.org/12.0/#sec-ecmascript-language-source-code What you read about UTF-16 is probably about the internal representation of strings (inside memory),that's totally unrelated to the encoding of the source. – leonbloy May 04 '23 at 17:57

2 Answers2

-1

Reference: http://mathiasbynens.be/notes/javascript-identifiers

UTF-8 characters are valid javascript variable names. Go ahead and encode UTF-8.

000
  • 26,951
  • 10
  • 71
  • 101
  • 3
    Unicode characters and UTF-8 encoding are different things. The standard actually seems to require UTF-16, not UTF-8 (but that doesn’t seem to be true in practice). It’s nice to have confirmation Unicode characters are valid variable names though. – Nate Apr 12 '12 at 15:14
  • 8
    Although available, I can't recommend doing things like `var Hͫ̆̒̐ͣ̊̄ͯ͗͏̵̗̻̰̠̬͝ͅE̴̷̬͎̱̘͇͍̾ͦ͊͒͊̓̓̐_̫̠̱̩̭̤͈̑̎̋ͮͩ̒͑̾͋͘Ç̳͕̯̭̱̲̣̠̜͋̍O̴̦̗̯̹̼ͭ̐ͨ̊̈͘͠M̶̝̠̭̭̤̻͓͑̓̊ͣͤ̎͟͠E̢̞̮̹͍̞̳̣ͣͪ͐̈T̡̯̳̭̜̠͕͌̈́̽̿ͤ̿̅̑Ḧ̱̱̺̰̳̹̘̰́̏ͪ̂̽͂̀͠ = 'Zalgo';` – 000 Apr 12 '12 at 15:39
  • 4
    The standard says that the native text processing model of JavaScript is based on UTF-16 code units. That doesn't specify what byte-encoding is used to convert a source file to those units. – bobince Apr 14 '12 at 12:36
-1

I can't find documentation that says that Node treats files as encoded in UTF-8, but it seems that way experimentally:

/* Check in your editor that this Javascript file was saved in UTF-8 */
var nonEscaped = "Планета_Зямля";
var escaped = "\u041f\u043b\u0430\u043d\u0435\u0442\u0430\u005f\u0417\u044f\u043c\u043b\u044f";
if (nonEscaped === escaped) {
  console.log("They match");
}

The above example prints They match.

Non-BMP note:

Note that UTF-8 supports non-BMP code points (U+10000 and onwards), but Javascript has complications in that case, it automatically converts them to surrogate pairs. This is part of the language:

/* Check in your editor that this Javascript file was saved in UTF-8 */
var nonEscaped = ""; // U+1F4A9
var escaped1 = "\ud83d\udca9";
if (nonEscaped === escaped1) {
  console.log("They match");
}
/* Newer implementations support this syntax: */
var escaped2 = "\u{1f4a9}";
if (nonEscaped === escaped2) {
   console.log("The second string matches");
}

This prints They match and The second string matches.

Flimm
  • 136,138
  • 45
  • 251
  • 267