2

I am building a simple application where users can load any file into the Monaco editor in a web browser.

I'm trying to work out if the file that the user has loaded is text, and therefore editable.

In JavaScript, the library I am using to load returns the loaded file as an ArrayBuffer. Of course I can just convert this to text regardless of whether or not it is text or binary and throw the result into the editor. Presumably binary converted to text will display as garbage in the Monaco editor.

I could also examine the mime type of the loaded file. This would go a long way towards solving the problem, but it means I somehow have to know which mime types are text- I have not been able to find anything that specifies this. Also, it means any file without the correct mime type set would not be editable.

So my question is, is there a way to know if the contents of a JavaScript ArrayBuffer is text or binary data such as an image or executable code, by examining the data itself, rather than referring to mime type?

EDIT: This question is not a duplicate of questions that are simply asking how to convert an ArrayBuffer to text. Simply converting an ArrayBuffer to text doesn't tell whether nor not this is a file that contains editable text or if it is a binary file. Additional information is needed, such as the magic number suggested in the answers to this question.

Duke Dougal
  • 24,359
  • 31
  • 91
  • 123
  • Does this answer your question? [Conversion between UTF-8 ArrayBuffer and String](https://stackoverflow.com/questions/17191945/conversion-between-utf-8-arraybuffer-and-string) – Jared Smith Sep 07 '21 at 23:30
  • Can't you just read a handful of bits and then make a guess? – F. Müller Sep 07 '21 at 23:31
  • 1
    A very rough heuristic would be to see if any characters with codes <32 except for 9, 10 and 13 exist in the buffer - if yes, it's binary. But this is only an 80% solution - it's a complicated topic. For instance, UTF-16-encoded text would be incorrectly recognized as binary this way, but this opens up the new question how you even detect the encoding (you need another heuristic for that) to be able to parse the text properly... – CherryDT Sep 07 '21 at 23:36
  • You will find a [multitude of different approaches](https://www.npmjs.com/search?q=isbinary) on npm :) – CherryDT Sep 07 '21 at 23:38
  • @JaredSmith no, simply converting an ArrayBuffer to text doesn't tell whether nor not this is a file that contains editable text or if it is a binary file. Additional information is needed, such as the magic number suggested in the answers to this question. – Duke Dougal Sep 07 '21 at 23:44

2 Answers2

2

You can check the Magic numbers of the ArrayBuffer. Magic numbers are a sort of constants in files buffer that you can check to distinguishing between many file formats

Wikipedia - Magic numbers

This NPM module use that approach. Here's a list of the module's supported file types, you can see that they don't support text types.

UPDATE: I've writed an article about this which contains more explanations and a little Sandbox

the_previ
  • 683
  • 6
  • 12
  • This is certainly pointing in the right direction although the npm module referred to determines if binary data represents a known binary format, it does not make any determination about whether it is text. This package which I found as a result of the comment above from @CherryDT however does claim to make such a determination https://www.npmjs.com/package/istextorbinary – Duke Dougal Sep 07 '21 at 23:48
  • 1
    @DukeDougal if you don't want to know what exact type of text is you can go with the library you've found in combination with `file-type` module. If you want to know the type of text you need a proper parser for every type of text-based file, I update my answer with two examples for SVG and CSV – the_previ Sep 08 '21 at 00:01
0

A late answer that might be helpful to someone

I have developed a package to solve this problem: https://www.npmjs.com/package/arraybuffer-isbinary

import { isBinaryFile } from 'arraybuffer-isbinary'

console.log(isBinaryFile(buffer))
Tachibana Shin
  • 2,605
  • 1
  • 5
  • 9