28

I'm wondering what would be the best way to check if a file is binary or ASCII with Node.js?

There appears to be two ways not specific to node.js:

  1. Checking the MIME type: How to Check if File is ASCII or Binary in PHP - however this has it's problems, as for instance pre-precessors often don't have a recognised mime type and revert to application/octet-stream when checking them using mime

  2. Via checking the byte size using a stream buffer with How to identify the file content as ASCII or binary - which seems quite intensive, and does yet provide a node.js example.

So is there another way already? Perhaps a secret node.js call or module that I don't know about? Or if I have to do this myself, what way would be suggested?

Thanks

Community
  • 1
  • 1
balupton
  • 47,113
  • 32
  • 131
  • 182
  • Can you define what you mean by a "binary file"? The way you test depends on precisely what you mean and there is no universally agreed definition. – David Schwartz Apr 19 '12 at 09:44
  • Let's say an image, or more specifically anything that isn't text. Sorry about that! – balupton Apr 19 '12 at 09:47
  • That's really not specific enough. What do you plan to do with the information? (Would it be sufficient to check the first 8KB for non-ASCII characters?) – David Schwartz Apr 19 '12 at 10:00
  • Sure. The issue is that there are several approaches it seems, but I'm not sure how any of them could be ported to Node.js. Your suggestion there seems great, so I'd happily accept that with a code example can provide the how - as the documentation isn't so clear on how you perform such a check (are those bytes ASCII or not). – balupton Apr 19 '12 at 14:03
  • You can probably consider the bytes ASCII if the high bit is clear. But that will fail for things like UTF-8 or Unicode that you may (or may not) consider text. You really do need to provide a precise definition of what "text" and "binary" mean, or you need to document your use case so we can figure out the right definitions. – David Schwartz Apr 19 '12 at 19:23
  • The only way is to check if there's some byte greater than 127, otherwise you can't. – Gabriel Llamas Apr 28 '12 at 21:56
  • Good question! But what about non-ascii text files? Like UTF-8 or something? I think the intent of the question is to decide whether a file contains some sort of "text"...or not. Is there any other approach? Even a less-than-perfect strategy? Suppose you are creating some sort of "file browser" and you want to maybe display a "preview" of the contents ( if it's text ). – Nick Perkins Apr 30 '12 at 22:09
  • My solution only works on *nix because it uses `grep`: I made this gist:[gist.github.com/elundmark/c1db309c868a67b50644](https://gist.github.com/elundmark/c1db309c868a67b50644) – elundmark Nov 27 '14 at 16:55
  • A more precise question would be how to check if a file is ASCII or non-ASCII. Fundamentally ASCII files consist of a series of 1s and 0s and are no less binary than any other encoding. – user1671787 May 29 '18 at 18:30

4 Answers4

15

Thanks to the comments on this question by David Schwartz, I created istextorbinary to solve this problem.

Community
  • 1
  • 1
balupton
  • 47,113
  • 32
  • 131
  • 182
  • 2
    Consider updating your question if your intent was really to identify text files in general and not ASCII encoding specifically. – maerics May 01 '12 at 06:10
  • 21
    coffeescript prevents people to easily submit patches. so you don't have to maintain much. – André Fiedler Jul 27 '14 at 11:23
  • 2
    for what it is worth, istextorbinary is now javascript – balupton May 28 '18 at 23:44
  • Note that it might be easier now after node.js introduced the `buffer.isUtf8(input)` and `buffer.isAscii(input)` API functions (Added in: node.js v19.4.0, v18.14.0, and in: v19.6.0, v18.15.0, respectively): https://nodejs.org/api/buffer.html#bufferisutf8input -- also see https://stackoverflow.com/questions/75108373/how-to-check-if-a-node-js-buffer-contains-valid-utf-8 – Mörre Jul 27 '23 at 15:15
5

ASCII defines characters 0-127, so if a file's entire contents are byte values in that range then it can be considered an ASCII file.

function fileIsAscii(filename, callback) {
  // Read the file with no encoding for raw buffer access.
  require('fs').readFile(filename, function(err, buf) {
    if (err) throw err;
    var isAscii = true;
    for (var i=0, len=buf.length; i<len; i++) {
      if (buf[i] > 127) { isAscii=false; break; }
    }
    callback(isAscii); // true iff all octets are in [0, 127].
  });
}
fileIsAscii('/usr/share/dict/words', function(x){/* x === true */});
fileIsAscii('/bin/ls', function(x){/* x === false */});

If performance is critical then consider writing a custom C++ function per your linked answer.

maerics
  • 151,642
  • 46
  • 269
  • 291
3

I came here from google but as I couldn't find a satisfactory answer, I took another approach which works for me:

const string_to_test = "I am just a piece of text";
//const binary_to_test = "��˰!1�H��1�1����!H�=u�!�";
if(/\ufffd/.test(string_to_test) === true){
    console.log("I'm 'binary'");
}else{
    console.log("I'm proper text");
}

How does it works? If you try to open binary data in a normal way (without using a hex editor), it will encounter some rendering problems which translate to you as a succession of this weird character � called "Replacement character".

  • That's only the way *some* editors and browsers display binary as text. The js string itself (unless you are grabbing it from the textarea, or whatever text thing that does that) that is holding binary data read from a file will only have values from 0 - 255 and never have ufffd. Firefox does not convert to the same character. It uses a special font to show the char code value. – aamarks Apr 28 '18 at 17:54
  • That character substitution can also appear when you grab text from some place using utf-8 with certain characters and then you try to use it in your page that's using an older code page missing those characters so it's not necessarily an indication of binary. – aamarks Apr 28 '18 at 18:03
  • how do I decode binary to image? I am having trouble in doing this, below is the link for my question. https://stackoverflow.com/questions/54939990/decode-binary-of-image-to-base64 – Shoib Mohammed A Mar 01 '19 at 07:41
0

or pipe through a transform and use the once "data" event to set the encoding...`

const { Transform, pipeline } = require('stream'),
      { createReadStream, createWriteStream } = require('fs')

const parser = Transform({
    readableObjectMode: false ,
    writableObjectMode: false,
    transform(data, encoding,  callback) {
        callback(null, data)
    }
 })
parser.once('data', (chunk) => {
    let bin = /\ufffd/
    bin.test(chunk)
        ? parser.encoding = 'utf8'
        : parser.encoding = 'binary'
})
parser.on('data', (d) => parser._transform(d, 'binary', () => {}))
const file = createReadStream('./media-tests/uni.png')
const file2 = createWriteStream('./media-tests/uni2.png')
pipeline(file,parser, file2, ()=>{})