Japanese text file retrieved from Github via AJAX is garbled

Question

I'm using the following AJAX call to fetch a text file containing Japanese characters from another directory in the same Github repo.

$.ajax({
    type: "GET",
    url: "https://raw.githubusercontent.com/mystuff/japaneseProject/master/data/jp.txt",
    contentType: 'text/plain; charset=utf-8',
    dataType: "text",
    cache: false, 
    success: function(data) {
        console.log(data);
    }
});

The output of console.log(data), however, is just garbage:

Something is going on with the encoding, probably, but I have no idea what. Initially the URL was a direct Dropbox link which worked perfectly, but since Dropbox discontinued its public folder, it no longer does.

If I try other hosting services like Google Drive, I either hit a CORS error or the same garbage is outputted.

Here's an example of the text file.

Hey, could you link us to the dataset by chance, the url is a dead link. — Neil, Mar 18 '17 at 00:43
@nfnneil I added a link to the dataset. It's just a text file of a Japanese frequency word list. — user341554, Mar 18 '17 at 00:51
It displayed perfectly for me, I used my own server though, http://neil.computer/stack/japanese.txt (pastebin doesn't allow cross-origin). Try using that, does it work then? — Neil, Mar 18 '17 at 00:57
@nfnneil Firefox and Chrome both block the request due to having mixed content (the github is https while yours is http). — user341554, Mar 18 '17 at 02:38

Kaiido · Accepted Answer · 2017-03-18T03:48:36.377

Your pastebin link is of no use.
The problem is most likely that your .txt file has been encoded as one of the many japanese charset encodings, but that your page has its encoding set as utf-8.

Two solutions then:

The easiest, reencode your txt file as utf-8.
If you can't, you can fetch your file as a Blob, then read it as text thanks to a FileReader, and the second parameter of readAsText(blob, encoding).

(In following example, I did encode the txt file as ISO-2022-JP.)

fetch('https://dl.dropboxusercontent.com/s/ikr7tk47ygt2mfe/test-ISO2022-JP.txt?dl=0')
  .then(resp => resp.text())
  .then(text => raw.innerHTML = text);
 
fetch('https://dl.dropboxusercontent.com/s/ikr7tk47ygt2mfe/test-ISO2022-JP.txt?dl=0')
  .then(resp => resp.blob())
  .then(blob => {
    let fr = new FileReader();
    fr.onload = e => fileRead.innerHTML = fr.result;
    fr.readAsText(blob, 'ISO-2022-JP');
    });

table {
  margin-top: 12px;
  border-collapse: collapse;
}

td,
th {
  border: 1px solid #000;
  padding: 2px 6px;
   vertical-align: top;
}

tr {
  border: 0;
  margin: 0;
}

<table>
<tr>
<th>Raw response as text</th>
<th>From FileReader + encoding</th>
</tr>
<tr>
<td><pre id="raw"></pre></td>
<td><pre id="fileRead"></pre></td>
</tr>
</table>

Is there a way to check the encoding of the file? I'm pretty sure I saved it as "Unicode" on Windows Notepad. Would that make a difference, and if yes, why did my original direct Dropbox link work but not the raw file on Github? — user341554, Mar 18 '17 at 02:42
Just tried it again with a re-encoded file. I guess apparently Unicode and UTF-8 aren't the same thing after all! Always wondered what the difference between those two options were... — user341554, Mar 18 '17 at 02:59
@user351554 Ah windows and encoding... According to [this answer](http://stackoverflow.com/questions/13894898/unicode-file-in-notepad) notepad's *unicode* is utf-16 little endian. And no there is no way to check an file's encoding. The best we can do is guessing (e.g by checking for unknown characters, or character ranges). But japanese is one of the worst languages to detect, and there is no single one bullet-proof way to do it. — Kaiido, Mar 18 '17 at 03:41

Japanese text file retrieved from Github via AJAX is garbled

1 Answers1