21

Want to simlpy read user-input files as text.

Can rely on modern browser usage, so I use FileReader for that (which works like a charm).

reader.readAsText(myfile, encoding);

I know that encoding defaults to UTF-8.

But as my users will upload files from various sources (Windows, Mac, Linux) and various browsers I ask the user to provide the encoding via a select box.

So e.g. for a western european windows text file I expect the user to choose e.g. windows-1252.

I was not able to find a list of supported encodings for FileReader (assuming this is at least depending on the browser).

I am not asking to auto-determine the encoding, I just want to fill my select box in a way like:

<select id="encoding">
   <option value="windows-1252">Windows (Western Latin)</option>
   <option value="utf-8">UTF-8</option>
   <option value="...">...</option>
</select>

So my questions are:

  1. Where do I get a list of supported encodings to fill the option values?
  2. How to determine the exact writing of those values (is it 'utf8' or 'UTF-8' or...) and are those depending on the OS / browser?
  3. Does readAsText(myfile, unsupportedEncoding) throw any error which I can catch if encoding is not supported?

I'd prefer not to use any major 3rd party libraries for that.

Bonus Question:

Is there a simple way to get meaningful translations of the values, e.g. cp10029 means Mac (Central European)?

LBA
  • 3,859
  • 2
  • 21
  • 60
  • 2
    A cursory search of the googles didn't reveal much. Maybe this will help? http://stackoverflow.com/questions/37884928/cant-fit-file-encoding-when-working-with-chrome-file-system-api/37885580 – Dan Wilson Nov 24 '16 at 15:59
  • thanks, I googled a lot, that's why I am asking here :-( I checked your recommendation but this refers to a no-real-text-input IMHO but in my case all files are "real text" input only in different encodings. – LBA Nov 24 '16 at 16:16
  • 1
    The supported code-pages can be found [here](https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/TextDecoder#Parameters). I would recommend taking a second look at the link provided by Dan as this is a good way to go about it. This approach also let you detect BOM and features to allow guessing the encoding in advance. –  Nov 26 '16 at 05:43

2 Answers2

10
  1. Encoding standarts - https://github.com/whatwg/encoding/ (in JSON format - https://github.com/whatwg/encoding/blob/master/encodings.json. Use values from fields "labels")

enter image description here

  1. Encoding parameter is not case sensitive.

  2. NO, readAsText(myfile, unsupportedEncoding) not throw any error. The function simply uses the default encoding("utf8").

    window.onload = function() {
    
        //Check File API support
        if (window.File && window.FileList && window.FileReader) {
            var filesInput = document.getElementById("files");
    
            filesInput.addEventListener("change", function(event) {
    
                var files = event.target.files; //FileList object
                var output = document.getElementById("result");
    
                for (var i = 0; i < files.length; i++) {
                    var file = files[i];
    
                    //Only plain text
                    if (!file.type.match('plain')) continue;
    
                    var picReader = new FileReader();
    
                    picReader.addEventListener("load", function(event) {
    
                        var textFile = event.target;
    
                        var div = document.createElement("div");
    
                        div.innerText = textFile.result;
    
                        output.insertBefore(div, null);
    
                    });
                    //Read the text file
                    picReader.readAsText(file, "cP1251");
                }
    
            });
        }
        else {
            console.log("Your browser does not support File API");
        }
    }
    

Demo

To get translations of the values you can use JSON file (https://github.com/whatwg/encoding/blob/master/encodings.json), parameter "heading" and "name".

Vencovsky
  • 28,550
  • 17
  • 109
  • 176
Artee
  • 824
  • 9
  • 19
  • I am a bit concerned that WHATWG seems to be the only group trying to keep track of the obviously only living standard but your answer responds correctly to all of my questions so I'll accept it. As soon as there might be a better/"official" response I might change that, hope that sounds reasonable. – LBA Jan 23 '19 at 10:38
0

Names and labels
The table below lists all encodings and their labels user agents must support. User agents must not support any other encodings or labels.

# UTF-8
"unicode-1-1-utf-8"
"unicode11utf8"
"unicode20utf8"
"utf-8"
"utf8"
"x-unicode20utf8"
# IBM866
"866"
"cp866"
"csibm866"
"ibm866"

# ISO-8859-2
"csisolatin2"
"iso-8859-2"
"iso-ir-101"
"iso8859-2"
"iso88592"
"iso_8859-2"
"iso_8859-2:1987"
"l2"
"latin2"

# ISO-8859-3
"csisolatin3"
"iso-8859-3"
"iso-ir-109"
"iso8859-3"
"iso88593"
"iso_8859-3"
"iso_8859-3:1988"
"l3"
"latin3"

# ISO-8859-4
"csisolatin4"
"iso-8859-4"
"iso-ir-110"
"iso8859-4"
"iso88594"
"iso_8859-4"
"iso_8859-4:1988"
"l4"
"latin4"

# ISO-8859-5
"csisolatincyrillic"
"cyrillic"
"iso-8859-5"
"iso-ir-144"
"iso8859-5"
"iso88595"
"iso_8859-5"
"iso_8859-5:1988"

# ISO-8859-6
"arabic"
"asmo-708"
"csiso88596e"
"csiso88596i"
"csisolatinarabic"
"ecma-114"
"iso-8859-6"
"iso-8859-6-e"
"iso-8859-6-i"
"iso-ir-127"
"iso8859-6"
"iso88596"
"iso_8859-6"
"iso_8859-6:1987"

# ISO-8859-7
"csisolatingreek"
"ecma-118"
"elot_928"
"greek"
"greek8"
"iso-8859-7"
"iso-ir-126"
"iso8859-7"
"iso88597"
"iso_8859-7"
"iso_8859-7:1987"
"sun_eu_greek"

# ISO-8859-8
"csiso88598e"
"csisolatinhebrew"
"hebrew"
"iso-8859-8"
"iso-8859-8-e"
"iso-ir-138"
"iso8859-8"
"iso88598"
"iso_8859-8"
"iso_8859-8:1988"
"visual"

# ISO-8859-8-I
"csiso88598i"
"iso-8859-8-i"
"logical"

# ISO-8859-10
"csisolatin6"
"iso-8859-10"
"iso-ir-157"
"iso8859-10"
"iso885910"
"l6"
"latin6"

# ISO-8859-13
"iso-8859-13"
"iso8859-13"
"iso885913"

# ISO-8859-14
"iso-8859-14"
"iso8859-14"
"iso885914"

# ISO-8859-15
"csisolatin9"
"iso-8859-15"
"iso8859-15"
"iso885915"
"iso_8859-15"
"l9"

# ISO-8859-16
"iso-8859-16"

# KOI8-R
"cskoi8r"
"koi"
"koi8"
"koi8-r"
"koi8_r"

# KOI8-U
"koi8-ru"
"koi8-u"

# macintosh
"csmacintosh"
"mac"
"macintosh"
"x-mac-roman"

# windows-874
"dos-874"
"iso-8859-11"
"iso8859-11"
"iso885911"
"tis-620"
"windows-874"

# windows-1250
"cp1250"
"windows-1250"
"x-cp1250"

# windows-1251
"cp1251"
"windows-1251"
"x-cp1251"

# windows-1252
"ansi_x3.4-1968"
"ascii"
"cp1252"
"cp819"
"csisolatin1"
"ibm819"
"iso-8859-1"
"iso-ir-100"
"iso8859-1"
"iso88591"
"iso_8859-1"
"iso_8859-1:1987"
"l1"
"latin1"
"us-ascii"
"windows-1252"
"x-cp1252"

# windows-1253
"cp1253"
"windows-1253"
"x-cp1253"

# windows-1254
"cp1254"
"csisolatin5"
"iso-8859-9"
"iso-ir-148"
"iso8859-9"
"iso88599"
"iso_8859-9"
"iso_8859-9:1989"
"l5"
"latin5"
"windows-1254"
"x-cp1254"

# windows-1255
"cp1255"
"windows-1255"
"x-cp1255"

# windows-1256
"cp1256"
"windows-1256"
"x-cp1256"

# windows-1257
"cp1257"
"windows-1257"
"x-cp1257"

# windows-1258
"cp1258"
"windows-1258"
"x-cp1258"

# x-mac-cyrillic
"x-mac-cyrillic"
"x-mac-ukrainian"

More encoding see here: https://encoding.spec.whatwg.org/#names-and-labels

Avatar
  • 14,622
  • 9
  • 119
  • 198