8

I'm trying to save a CSV file using JavaScript, with a prepended UTF-8 BOM. However, when checking the downloaded file, it seems that the BOM is always stripped. The following code reproduces the issue:

var csv = '\ufefftest,test2';
var blob = new Blob([csv], {type: 'text/csv;charset=utf-8'});
var url = URL.createObjectURL(blob);
var a = document.createElement('a');
a.href = url;
a.download = 'test.csv';
document.body.appendChild(a);
a.click();

Adding the BOM character to the string twice produces the correct result:

var csv = '\ufeff\ufefftest,test2';

The resulting file should have the BOM character at the beginning.

Why is it being stripped in this example?

EDIT: My use case is generating a CSV file, and ensuring that the file can be opened with correct encoding by Microsoft Excel. I'm thinking that maybe the BOM is detected and truncated, but Excel needs the character to be present to detect UTF-8.

jkxyz
  • 83
  • 1
  • 5

3 Answers3

5

My best guess is that some browsers may interpret the BOM in the text and truncate it.

I added an example where the BOM is added by a ArrayBuffer to the Blob. This seems to be working.

But be aware that the BOM you are trying to add is the UTF-16 (BE) BOM not the UTF-8 one EF BB BF. https://de.wikipedia.org/wiki/Byte_Order_Mark

var csv = 'test,test2';

// create BOM UTF-8
var buffer = new ArrayBuffer(3);
var dataView = new DataView(buffer);
dataView.setUint8(0, 0xfe);
dataView.setUint8(1, 0xbb);
dataView.setUint8(2, 0xbf);
var read = new Uint8Array(buffer);

// create BOM UTF-16
var buffer = new ArrayBuffer(2);
var dataView = new DataView(buffer);
dataView.setUint8(0, 0xfe);
dataView.setUint8(1, 0xff);
var read = new Uint8Array(buffer);

var blob = new Blob([read /*prepend bom*/, csv], {type: 'text/csv;charset=utf-8'});
var url = URL.createObjectURL(blob);
var a = document.createElement('a');
a.href = url;
a.download = 'test.csv';
document.body.appendChild(a);
a.click();
Bellian
  • 2,009
  • 14
  • 20
  • When opening the resulting file in Emacs, it doesn't seem to detect the UTF-8 encoding, just displaying 3 Latin charset characters for those codepoints. – jkxyz May 16 '19 at 08:38
  • 1
    Like i said in the answer, the BOM FEFF is NOT the UTF-8 BOM but the UTF-16 (BE) one. – Bellian May 16 '19 at 08:57
  • For utf-8 it looks like a typo in the first byte - should be `ef` instead of `fe`. But otherwise this works for me! – Demerit Aug 31 '21 at 23:50
2
var csv = 'test,test2';

var blob = new Blob([decodeURIComponent('%ef%bb%bf') /*prepend bom*/, csv], {type: 'text/csv;charset=utf-8'});
var url = URL.createObjectURL(blob);
var a = document.createElement('a');
a.href = url;
a.download = 'test.csv';
document.body.appendChild(a);
a.click();
0

Your BOM is here.

It's simply that whatever you use to read it discards it, since, well, it shouldn't be part of the text.
However if you make an HEX dump or read it as an ArrayBuffer, you'll see it's still there:

const csv = '\ufefftest,test2';
const blob = new Blob([csv], {type: 'text/csv;charset=utf-8'});
download(blob);
read(blob);

inp.onchange = e => read(inp.files[0]);

async function read(blob) {
  // grab the byte content
  const buf = await new Response(blob).arrayBuffer();
  // stupidly map to some string characters
  const str = [...new Uint8Array(buf)]
    .map(c => String.fromCharCode(c)); // only for the demo, this doesnt convert from bytes to string in UTF-8!
  console.log(str);
}

function download(blob) {
  const a = document.createElement('a');
  a.download = 'file.csv';
  a.href = URL.createObjectURL(blob);
  a.textContent = 'download';
  document.body.prepend(a);
}
<br><label>you can reupload it here too<input type="file" id="inp"></label>

And note that the other answer is right in that your BOM is actually the one of UTF-16BE, but that's not your problem yet.

Kaiido
  • 123,334
  • 13
  • 219
  • 285