2

I am having a problem generating and downloading a text file in UTF8 that includes an emoji. The problem is, when I download the file that includes and emoji, the generated file is not encoded in UTF8 and the emoji is not shown correctly.

I've used this solution to generate and download the file I need. This is the code I use:

function download(filename, text) {
    let element = document.createElement('a');
    element.setAttribute('href', 'data:text/plain;charset=utf-8,' + encodeURIComponent(text));
    element.setAttribute('download', filename);
    document.body.appendChild(element);
    element.click();
    document.body.removeChild(element);
}

So, if I use it like this:

downloadFile('withoutEmoji.txt','This is a test without emoji');

It downloads a file in UTF8.

But, when I use it like this:

downloadFile('withEmoji.txt','This is a test with emoji ');

The file I download doesn't show the emoji correctly, and the encoding of the file is no longer UTF8.

If I convert the 'withEmoji.txt' file to UTF8 (using notepad++ for example) the emoji gets shown correctly in the file.

How can I force the file or text to be UTF8? or is there a way to convert the emoji before generating the file? I need the file to include the emoji, and to be in UTF8.

You can see this behaviour in this fiddle.

EDIT

Notepad++ recognises the 'withEmoji.txt' file with ANSI encoding. Vanilla notepad recognises the file with 'UTF8' encoding. Using this service the file gets recognised as "File Type: ASCII text, with no line terminators".

c-chavez
  • 7,237
  • 5
  • 35
  • 49
  • 1
    Are you sure that the editor you're using to create the JavaScript is saving the JavaScript source in UTF-8? – kshetline May 05 '18 at 03:35
  • Your code works as is: https://i.imgur.com/oKcqtjE.png – Blue May 05 '18 at 03:36
  • It's possible your text editor is not opening the file as UTF-8 after you've downloaded. Check for `File > Reopen with Encoding > UTF-8` – Patrick Roberts May 05 '18 at 03:43
  • @FrankerZ when I open it with notepad++ it doesn't recognise the UTF8 encoding, but vanilla notepad does... didn't see this before. So, which one is correct? – c-chavez May 05 '18 at 03:46
  • @c-chavez Humor me, and try: `element.setAttribute('href', 'data:text/plain;charset=utf-8,\uFEFF' + encodeURIComponent(text));` – Blue May 05 '18 at 03:48

2 Answers2

2

Files are just sequences of bytes stored in memory and / or on the disk. Encodings are how those byte sequences are interpreted into character sequences, or strings. You can't "force" a text editor to interpret a sequence of bytes in a particular way, it just happens that use of emojis cause some editors to mispredict the file encoding and open using the wrong one by default.

Text files don't have any metadata or header format that indicates their encoding, so there's nothing further you can do about this behavior.

As suggested in comments, a BOM might be used to hint at a UTF-8 encoding, but according to The Unicode Standard, p. 36:

Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

Patrick Roberts
  • 49,224
  • 10
  • 102
  • 153
  • @FrankerZ A BOM does not guarantee that the file will be interpreted as UTF-8. Only if the text editor recognizes it. – Patrick Roberts May 05 '18 at 03:56
  • A BOM is helpful where the encoding is required/agreed to be a Unicode encoding but you can't agree on which. In other cases, where there is no agreement at all, it just helps guess. – Tom Blodget May 05 '18 at 18:41
1

As has been mentioned, your code does seem to work. I created a Plunker here: http://plnkr.co/edit/IMpOJ6SCXCuw5VkKzkzo?p=preview

...that worked just fine for me.

function downloadFile(filename, text) {
  let element = document.createElement('a');
  element.setAttribute('href', 'data:text/plain;charset=utf-8,' + encodeURIComponent('\uFEFF' + text));
  element.setAttribute('download', filename);
  document.body.appendChild(element);
  element.click();
  document.body.removeChild(element);
}

function saveSample() {
  downloadFile('withEmoji.txt','This is a test with emoji ');
}

The only two reasons I can think of that you aren't getting good results is that either your text editor isn't saving your JavaScript code with the correct UTF-8 encoding, and/or when you open the saved file, it's not being opened with the correct UTF-8 encoding.

kshetline
  • 12,547
  • 4
  • 37
  • 73
  • apparently notepad++ is not recognising the encoding of the file as UTF8... am I the only one with this issue? Plain vanilla notepad does recognise the encoding as UTF8. – c-chavez May 05 '18 at 03:47
  • You could try adding the character `\uFEFF` to the output -- I've amended my code above with that change. Some text editors will do better recognizing a text file as UTF-8 if it starts with this special character, called a BOM (Byte Order Mark). – kshetline May 05 '18 at 03:56
  • @kshetline 0xFEFF is a UTF-16 BOM, the UTF-8 BOM is 0xEF,0xBB,0xBF – Patrick Roberts May 05 '18 at 03:58
  • 1
    @Patrick Roberts: The `encodeURIComponent` function will turn `\uFEFF` into the appropriate 3-byte encoding. – kshetline May 05 '18 at 04:00
  • @kshetline Ah, didn't realize that. – Patrick Roberts May 05 '18 at 04:02
  • @c-chavez Has the BOM helped? – Blue May 05 '18 at 05:30
  • @FrankerZ yes it did, adding the character \uFEFF to the output is always opening my file as UTF8 and showing correctly the emojis. Thank you! – c-chavez May 12 '18 at 06:12