15

I am attempting to write a node app that reads in a set of files, splits them into lines, and puts the lines into an array. Pretty simple. It works on quite a few files except some SQL files that I am working with. For some reason I seem to be getting some kind of unicode output when I split the lines up. The app looks something like this:

fs = require("fs");
var data = fs.readFileSync("test.sql", "utf8");
console.log(data);
lines = data.split("\n");
console.log(lines);

The input file looks something like this:

use whatever
go

The output looks like this:

��use whatever
go

[ '��u\u0000s\u0000e\u0000 \u0000w\u0000h\u0000a\u0000t\u0000e\u0000v\u0000e\u0000r\u0000',
  '\u0000g\u0000o\u0000',
  '\u0000' ]

As you can see there is some kind of unrecognized character at the beginning of the file. After reading the data in and directly outputting it, it looks okay except for this character. However, if I then attempt to split it up into lines, I get all these unicode-like characters. Basically it's all the actual characters with "\u0000" at the beginning of each one.

I have no idea what's going on here but it appears to have something to do with the characters in the file itself. If I copy and paste the text of the file into another new file and run the app on the new file, it works fine. I assume that whatever is causing this issue is being stripped out during the copy and paste process.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
d512
  • 32,267
  • 28
  • 81
  • 107

4 Answers4

26

Your file is in UTF-16 Little Big Endian, not UTF-8.

var data = fs.readFileSync("test.sql", "utf16le"); //Not sure if this eats the BOM

Unfortunately node.js only supports UTF-16 Little Endian or UTF-16LE (Can't be sure from reading docs, there is a slight difference between them; namely that UTF-16LE does not use BOMs), so you have to use iconv or convert the file to UTF-8 some other way.

Example:

var Iconv  = require('iconv').Iconv,
    fs = require("fs");

var buffer = fs.readFileSync("test.sql"),
    iconv = new Iconv( "UTF-16", "UTF-8");

var result = iconv.convert(buffer).toString("utf8");
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • Wow, you nailed it. Thank you. So just out of curiosity, how did you know this file is big endian UTF-16? Is there a way to detect that in node? I am processing several files and they are not all encoded the same way. – d512 Jan 18 '13 at 17:28
  • @user1334007 because of the nulls at even positions, if they were at odd positions it would have been little endian. Detecting encoding automatically requires some heuristics by analyzing the null positions to determine which UTF-16, and UTF-8 has very unique patterns. But most other encodings cannot be detected without just trying and seeing if the text comes out right. – Esailija Jan 18 '13 at 17:31
  • FYI, I found something that looks promising for charset detection with node: https://github.com/mooz/node-icu-charset-detector. Haven't tried it yet, but if I get it working, I'll report back. – d512 Jan 18 '13 at 18:10
  • @user1334007 yeah, but note that it's impossible to detect encoding reliably. It's worth trying though if you have many files and/or cannot manually detect them – Esailija Jan 18 '13 at 18:15
  • 1
    Yeah, I tried it out and found that it was not of much help. In the end I rewrote the tool in .NET and it works much better. – d512 Jan 18 '13 at 23:28
  • The nulls are at odd positions, so it is really little endian (which is the more common form of UTF-16 as Windows prefers it), so there shouldn't be a problem. – bobince Jan 19 '13 at 00:05
  • @bobince yeah didn't notice the `u` after the bom :X will fix – Esailija Jan 19 '13 at 09:52
  • FWIW: Parsing a text file exported from an Excel VBA script as XlFileFormat.xlUnicodeText, it seems to be UTF-16 LE as well, so I only needed to specify the "utf16le" format in node (then split by "\r\n" for distinct lines). Thanks for this answer! – Tyler Dec 20 '13 at 21:43
  • It's very easy to convert both file encodings and line endings with Visual Studio Code. There are selectors at the bottom and you can quickly resave the file from there :) – jocull Oct 25 '20 at 01:27
2

Is this perhaps the BOM (Byte-Order-Mark)? Make sure you save your files without the BOM or include code to strip the BOM.

The BOM is usually invisible in text editors.

I know Notepad++ has a feature where you can easily strip a BOM from a file. Encoding > Encode in UTF-8 without BOM.

Halcyon
  • 57,230
  • 10
  • 89
  • 128
  • Turns of the first character is the BOM. However, removing it does not seem to fix the "\u0000" issue. – d512 Jan 18 '13 at 17:04
  • I used the Notepad++ to convert my file to UTF-8, then read the file using fs.readFileSync to get rid of "SyntaxError: Unexpected token � in JSON at position 0" – Zee Apr 17 '21 at 12:51
1

I did the following in Windows command prompt to convert the endianness:

type file.txt > file2.txt
Chong Lip Phang
  • 8,755
  • 5
  • 65
  • 100
  • Not only my problem with parsing Apache log file fixed with this solution but also the file size decreased to almost half of its original size. – Farid Rn Apr 14 '20 at 10:00
0

Use the lite version of Iconv-lite

var result= "";
var iconv = require('iconv-lite');
var stream = fs.createReadStream(sourcefile)
    .on("error",function(err){
        //handle error
    })
    .pipe(iconv.decodeStream('win1251'))
    .on("error",function(err){
        //handle error
    })
    .on("data",function(data){
        result += data;
    })
    .on("end",function(){
       //use result
    });
Vikas
  • 24,082
  • 37
  • 117
  • 159