node.js readfile error with utf8 encoded file on windows

Question

I'm trying to load a UTF8 json file from disk using node.js (0.10.29) on Windows 8.1. The following is the code that runs:

var http = require('http');
var utils = require('util');
var path = require('path');
var fs = require('fs');

var myconfig;
fs.readFile('./myconfig.json', 'utf8', function (err, data) {
    if (err) {
        console.log("ERROR: Configuration load - " + err);
        throw err;
    } else {
        try {
            myconfig = JSON.parse(data);
            console.log("Configuration loaded successfully");
        }
        catch (ex) {
            console.log("ERROR: Configuration parse - " + err);
        }


    }
});

I get the following error when I run this:

SyntaxError: Unexpected token ´╗┐
    at Object.parse (native)
    ...

Now, when I change the file encoding (using Notepad++) to ANSI, it works without a problem.

Any ideas why this is the case? Whilst development is being done on Windows the final solution will be deployed to a variety of non-Windows servers, I'm worried that I'll run into issues on the server end if I deploy an ANSI file to Linux, for example.

According to my searches here and via Google the code should work on Windows as I am specifically telling it to expect a UTF-8 file.

Sample config I am reading:

{
    "ListenIP4": "10.10.1.1",
    "ListenPort": 8080
}

I have had wierd things happen with reading files in node... but sometimes doing a (data+'') will make the string behave more correctly. Also if it is valid json you could always make it a .js file and do module.exports = { /* data here */ }; then require it, though I don't think that will help with this problem. — Catalyst, Jun 22 '14 at 23:34

zamnuts · Accepted Answer · 2014-06-24T18:37:43.013

44

Per "fs.readFileSync(filename, 'utf8') doesn't strip BOM markers #1918", fs.readFile is working as designed: BOM is not stripped from the header of the UTF-8 file, if it exists. It at the discretion of the developer to handle this.

Possible workarounds:

data = data.replace(/^\uFEFF/, ''); per https://github.com/joyent/node/issues/1918#issuecomment-2480359
Transform the incoming stream to remove the BOM header with the NPM module bomstrip per https://github.com/joyent/node/issues/1918#issuecomment-38491548

What you are getting is the byte order mark header (BOM) of the UTF-8 file. When JSON.parse sees this, it gives an syntax error (read: "unexpected character" error). You must strip the byte order mark from the file before passing it to JSON.parse:

fs.readFile('./myconfig.json', 'utf8', function (err, data) {
    myconfig = JSON.parse(data.toString('utf8').replace(/^\uFEFF/, ''));
});
// note: data is an instance of Buffer

edited Jun 24 '14 at 18:37

answered Jun 24 '14 at 00:59

zamnuts

9,492
3
39
46

1

Thanks, hopefully devs new to node.js find this page. As you might imagine going to that level of searching node issues seems a bit much. Then again, all Youtube vids in regards to node.js dev do show Macs being used. The fact that I hit this problem with my first ever node.js project shows me that it may be hit by others. – Dominik Jun 24 '14 at 01:38
IMHO the BOM is meta data and not part of the file. Developers (and Users) want the text in a file and that excludes the BOM. – Marc Nov 29 '16 at 08:24
@Marc Node.js reads the file as bytes (Buffer), which is more low-level than string (text). Strings require an encoding, and if Node.js assumed the file being read as such, we wouldn't be able to read blobs easily. Do not make the assumption that all files are text. For this reason, decoding of the file is left to the program. Furthermore, knowing the encoding is required to understand the abstracted text stream, which could be 8-bit ASCII, UTF-7, UTF-8, UTF-16, ISO-8859-1, etc. Just because the BOM isn't there in a UTF-8 file, doesn't mean there isn't an endianness: by default UTF-8 is BE. – zamnuts Nov 30 '16 at 16:32
1

My point is node need not be "dumber" than notepad.exe which can open any file and display it correctly. True there are issues but when a BOM is present it's clear and unique. I take issue with IT systems which say: if we can't do a 100% job we'll do a 0%. No! Best effort and fail gracefully. – Marc Dec 01 '16 at 18:56
_"if Node.js assumed the file being read as such, we wouldn't be able to read blobs easily"_ I completely disagree. If you're specifying utf8 encoding and there's a BOM, remove the BOM. That's not assuming anything. You've already said "this is utf8". If it's not safe to assume a BOM is a BOM when you already know the file is utf8 then there's literally no point to the BOM. Either you see the BOM and you know it's a BOM and you can safely remove it, or you don't see the BOM. There's no mid-ground of seeing the BOM but not being able to safely remove it. – Clonkex Jul 06 '22 at 23:14
BOM is not applicable to utf8, see https://unicode.org/faq/utf_bom.html#gen6. Node.js also supports utf16le, but not utf16be (unless w/ `small-icu` option). When reading the file, we're making an assumption that it is utf8, but if we encounter the BOM we may consider whether the file is utf16 or utf32, or just utf8 with a superfluous BOM. See https://nodejs.org/docs/latest-v16.x/api/util.html#new-textdecoderencoding-options, https://superuser.com/a/1553672/157958, https://superuser.com/a/1648808/157958, and https://github.com/nodejs/node/search?q=bom&type=issues. – zamnuts Jul 10 '22 at 04:39

score 4 · Answer 2 · answered Jun 24 '14 at 00:42

To get this to work without I had to change the encoding from "UTF-8" to "UTF-8 without BOM" using Notepad++ (I assume any decent text editor - not Notepad - has the ability to choose this encoding type).

This solution meant that the deployment guys could deploy to Unix without a hassle, and I could develop without errors during the reading of the file.

In terms of reading the file, the other response I sometimes got in my travels was a question mark appended before the start of the file contents, when trying various encoding options. Naturally with a question mark or ANSI characters appended the JSON.parse fails.

Hope this helps someone!

Griffin · Answer 3 · 2022-09-21T09:25:14.320

New answer
As i had the same problem with several different formats I went ahead and made a npm that try to read textfiles and parse it as text, no matter the original format. (as original question was to read a .json it would fit perfect). (files without BOM and unknown BOM is handled as ASCII/latin1)

https://www.npmjs.com/package/textfilereader

So change the code to

var http = require('http');
var utils = require('util');
var path = require('path');
var fs = require('textfilereader');

var myconfig;
fs.readFile('./myconfig.json', 'utf8', function (err, data) {
    if (err) {
        console.log("ERROR: Configuration load - " + err);
        throw err;
    } else {
        try {
            myconfig = JSON.parse(data);
            console.log("Configuration loaded successfully");
        }
        catch (ex) {
            console.log("ERROR: Configuration parse - " + err);
        }
    }
});

Old answer

Run into this problem today and created function to take care of it. Should have a very small footprint, assume it's better than the accepted replace solution.

function removeBom(input) {
  // All alternatives found on https://en.wikipedia.org/wiki/Byte_order_mark
  const fc = input[0].charCodeAt(0).toString(16);
  switch (fc) {
    case 'efbbbf': // UTF-8
    case 'feff': // UTF-16 (BE) + UTF-32 (BE)
    case 'fffe': // UTF-16 (LE)
    case 'fffe0000': // UTF-32 (LE)
    case '2B2F76': // UTF-7
    case 'f7644c': // UTF-1
    case 'dd736673': // UTF-EBCDIC
    case 'efeff': // SCSU
    case 'fbee28': // BOCU-1
    case '84319533': // GB-18030
      return input.slice(1);
      break;
    default: 
      return input;
  }
}

const fileBuffer = removeBom(fs.readFileSync(filePath, "utf8"));

node.js readfile error with utf8 encoded file on windows

3 Answers3

Linked