5

So, I'm trying to write a CSV-file importer using AngularJS on the frontend side and NodeJS for the backend. My problem is, that I'm not sure about the encoding of the incoming CSV files. Is there a way to automatically detect it?

I first tried to use FileReader.readAsDataURL() and do the detection in Node. But the file contents will be Base64-encoded, so I cannot do that (When I decode the file, I already need to know the encoding). If I do a FileReader.readAsText(), I also need to know the encoding beforehand. I also cannot do it BEFORE initializing the FileReader, because the actual file object doesn't seem include the files contents.

My current code:

generateFile = function(file){
    reader = new FileReader();
    reader.onload = function (evt) {
        if (checkSize(file.size) && isTypeValid(file.type)) {
            scope.$apply(function () {
                scope.file = evt.target.result;
                file.encoding = Encoding.detect(scope.file);
                if (angular.isString(scope.fileName)) {
                    return scope.fileName = name;
                }
            });
            if (form) {
                form.$setDirty();
            }
            scope.fileArray.push({
                name: file.name,
                type: file.type,
                size: file.size,
                date: file.lastModified,
                encoding: file.encoding,
                file: scope.file
            });
            --scope.pending;
            if (scope.pending === 0){
                scope.$emit('file-dropzone-drop-event', scope.fileArray);
                scope.fileArray = [];
            }
        }
    };
    let fileExtExpression = /\.csv+$/i;
    if(fileExtExpression.test(file.name)){
        reader.readAsText(file);
    }
    else{
        reader.readAsDataURL(file);
    }
    ++scope.pending;
}

Is this just impossible to do or what am I doing wrong? I even tried to solve this using FileReader.readAsArrayBuffer() and extract the file header from there, but this was way too complex for me and/or didn't seem to work.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
DCH
  • 199
  • 2
  • 3
  • 13
  • Sadly file encoding are not easily detectable, so where is your `Encoding.detect` coming from? Also as far as I know most text editors just "probe" the file for some typical encoded characters and then guess the encoding from that and read it again with that encoding. – xander Feb 20 '18 at 12:32
  • The encoding.detect is from [this external library](https://github.com/polygonplanet/encoding.js). Unfortunately it doesn't work since I cannot put the file contents into it *before* I used FileReader.readAsXYZ()... – DCH Feb 20 '18 at 14:02
  • `base64` is a byte encoding, not a character encoding. It turns an array of bytes into a string. So when you decode it, you get back an array of bytes; you don't need to know any character encoding for that just yet. Now these bytes *can* represent a string, in which case they need further decoding, and for this step you will need to know the character encoding. Given a byte array you can make a few educated guesses, this works reasonably well with UTF encodings. But the problem is single-byte encodings, which are impossible to distinguish with certainty. – Tomalak Feb 20 '18 at 14:05
  • Now the question is, *why* is the contents of the files base64-encoded? That doesn't make a lot of sense. Files are meant to be byte storage. You can write bytes to the files verbatim. Encoding the byte stream as base64 does nothing in this case, except making the file larger and slowing down both writing and reading of the file. – Tomalak Feb 20 '18 at 14:09
  • Yes, I know that. So base64 is now only used to upload images. So, for my CSV files I would like to just read it as text. But if I don't know how to dynamically set the encoding-parameter for FileReader.readAsText(). – DCH Feb 20 '18 at 14:14
  • You don't really need base64 to upload images, either. base64 is only needed to transfer (or store) arbitrary bytes in an environment that does not tolerate raw bytes - typically that's limited to string-based formats such as JSON or XML. HTTP has no problem transferring raw bytes, that's what it has been made for. So... why are you using base64 at all? – Tomalak Feb 20 '18 at 14:38
  • Because that's the way FileReader.readAsDataURL() works... this is getting slightly off topic ;-). – DCH Feb 20 '18 at 14:43
  • Well, not really. Read my first comment again, I've written down what to do with the base64 string. I'm still not sure why the `readAsDataURL()` detour is necessary, `readAsArrayBuffer()` seems to be the the better choice. – Tomalak Feb 20 '18 at 14:49
  • I tried readAsArrayBuffer... but when I passed this on via my http-request inside the body, it arrived as an empty object, not usable for a Buffer. Regarding your first comment: When I decode the base64-string using Buffer.toString, I still need to know the original encoding, which is what I don't know at that point. – DCH Feb 20 '18 at 15:41
  • Ah, I see. Read: https://stackoverflow.com/questions/19959072/sending-binary-data-in-javascript-over-http regarding this problem. – Tomalak Feb 20 '18 at 15:43
  • Also, as I said, base64 is a *byte* encoding. You can't decode it to string. You can decode it to *bytes*, and that's what you should be doing. Unless of course you manage to skip the entire base64 stage altogether, then clearly *that* is what you should be doing. – Tomalak Feb 20 '18 at 15:45
  • I'm highly confused now... I've been using this to decode my base64-encoded string to string just fine: `new Buffer(base64, 'base64').toString('latin1')`. I just changed this, since I don't know the actual encoding for the toString method. Well, I will read the link you just posted and try to make something of it. Thanks! – DCH Feb 20 '18 at 15:54
  • But you are calling two functions here: base64 → bytes (`Buffer(base64, 'base64')`) and bytes → string (`thatBuffer.toString('latin1')`). The information about string encoding is only necessary for the second step, and the conversion *from* base64 only happens in the first. Incidentally you need to feed bytes to encoding detectors, because that is their purpose in life: Making a guess what string encoding the bytes in question could represent. So decoding base64 does not involve anything with strings, even if you've only used it in combination so far. – Tomalak Feb 20 '18 at 16:01

2 Answers2

3

I suggest you open your CSV using readAsBinaryString() from FileReader. This is the trick. Then you can detect the encoding using the library jschardet

More info here: CSV encoding detection in javascript

guillim
  • 1,517
  • 1
  • 12
  • 16
  • In node: `const jschardet = require('jschardet'); jschardet.detect(await require('fs/promises').readFile(fileName))`, e.g., `jschardet.detect(await require('fs/promises').readFile('test.txt'))`, output: `{ encoding: 'UTF-8', confidence: 0.99 }` – mikey Nov 17 '22 at 13:50
3

You could try this:

$ npm install detect-file-encoding-and-language

And then detect the encoding like so:

// index.js

const languageEncoding = require("detect-file-encoding-and-language");

const pathToFile = "/home/username/documents/my-text-file.txt"

languageEncoding(pathToFile).then(fileInfo => console.log(fileInfo));
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { language: 0.97, encoding: 1 } }
Falaen
  • 363
  • 4
  • 13