-1

I need to export string data that includes the 'degrees' symbol ("\u00B0"). This data is exported as a csv text file with UTF-8 encoding. As would be expected, the degrees symbol is encoded as two characters (0xC2, 0xB0) within the java (unicode) string. When the CSV file is imported into Excel, it is displayed as a capital A with an circumflex accent, followed by the degrees symbol.

I know that "UTF-8" only supports 7-bit ASCII (as a single byte), not 8-bit "extended ASCII", and "US-ASCII" only supports 7-bit ASCII period.

Is there some way to specify encoding such that the 0xC2 prefix byte is suppressed?

I'm leaning toward allowing normal processing to occur, then reading & overwriting the file contents, stripping the extra byte.

I'd really prefer a more eloquent solution...

gOnZo
  • 489
  • 4
  • 15
  • 1
    Excel converts the encoding into Microsoft standard CP1252, by default. Unless differently instructed. You should use Google **before** using SO... http://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically – Phantômaxx Jun 18 '15 at 19:18

1 Answers1

0

Excel assumes csv files are in an 8-bit code page.

To get Excel to parse your csv as UTF-8, you need to add a UTF-8 Byte Order Mark to the start of the file.

Edit:

If you're in Western Europe or US, Excel will likely use Windows-1252 character set for decoding and encoding when encountering files without a Unicode Byte Order Mark.

As 0xC2 and 0xB0 are both legal Windows-1252 characters, Excel will decode to the following:

0xC2 = Â
0xB0 = °

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • Thanks, I wasn't aware of the Byte Order Mark. Adding it to the start of file fixed Excel's odd interpretation issues. It's wierd - Excel interpreted the 0xC2 prefix as an extended character, but correctly interpreted 0xB0 as the degree symbol (i.e., without the prefix UTF-8 would require). Also, in 8-bit "extended ASCII", 0xF8 is the degree symbol, so Excel is not interpreting as extended ASCII. It's almost as though Microsoft were playing fast-n-loose with the standards... – gOnZo Jun 19 '15 at 12:31
  • A second, related issue - With the UTF-8 BOM prefix, I can now import to Excel without problems in interpretation. However, if I then re-save the file as "CSV", Excel does not restore the UTF-8 BOM prefix. It does helpfully offer to store the data as a UNIICODE ".TXT" file (which means two bytes for each character, each character prefixed with a 0 byte). – gOnZo Jun 19 '15 at 12:36
  • Hi, I've updated my answer to explain what you're seeing. AFAIK, there's no such thing as "extended ASCII". All the popular 8bit ones encode the degrees symbol as 0xB0 - See http://www.fileformat.info/info/unicode/char/b0/charset_support.htm. – Alastair McCormack Jun 20 '15 at 19:28
  • Windows uses UTF-16 for encoding Unicode, hence why "UNICODE" mode appears to prefix ASCII chars and basic latin chars with `0x00`. For Unicode chars <= U+FFFF, UTF-16 maps the Unicode point exactly. Excel will write an UTF-16 BoM to the front of the file, which also defines the endianness of the file. – Alastair McCormack Jun 20 '15 at 19:46