0

I have tab-separated values which I need to export as a text file using Java, to be opened in Microsoft Excel. The problem arises when the tab-separated values have Chinese characters.

I tried exporting the text file using UTF-8 but Excel is not able to interpret the characters. Then I opened the exported text file in Notepad and saved it as "Unicode" and it started showing the correct charters in Excel.

enter image description here

So can someone tell me what is the Notepad "Unicode" equivalent in Java?

My code is:

response.getOutputStream().write(reportHTML.getBytes("UTF-8"));

Where reportHTML has tab-separated values.

This is the text file with encoding as Unicode.

dda
  • 6,030
  • 2
  • 25
  • 34
Ankur
  • 12,676
  • 7
  • 37
  • 67
  • Look at the file in a hex editor and determine whether it's UTF-8, UTF-16 or UTF-32 – jlordo Nov 28 '12 at 10:04
  • @jlordo Can you please suggest a hex editor and how to check encoding. I have also linked the correct text file which works correctly in excel – Ankur Nov 28 '12 at 10:08
  • Notepad++ tells me your File is in UCS-2 Little Endian. [Here is a List](http://docs.oracle.com/javase/1.4.2/docs/guide/intl/encoding.doc.html) of all supported encodings. – jlordo Nov 28 '12 at 10:14
  • @jlordo I don't see UCS-2 Little Endian in the list you provided, so this means we cannot do this using java? – Ankur Nov 28 '12 at 10:19
  • Excel should be able to handle UTF-8 (might need BOM, but don't think so), maybe you have an error in your implemantation. You could use UTF-16, read [here](http://en.wikipedia.org/wiki/UTF-16) to see the difference to UCS-2. – jlordo Nov 28 '12 at 10:27
  • Excel is not able to handle UTF-8 you can check that by downloading the linked text file, opening it in notepad and then saving it as "UTF-8" – Ankur Nov 28 '12 at 10:34
  • Don't have Excel atm. Your File is not UTF-8, even though your code should produce an UTF-8 file. Save your file as UTF-8 (when writing t, not in Notepad), not UCS-2 and when importing in Excel there is an option to specify the charset of the imported file. UTF-8 is in that list. – jlordo Nov 28 '12 at 10:45
  • @jlordo this is the UTF-8 file after adding bom as suggested by Dmitry Kurilo https://dl.dropbox.com/u/99923120/stackoverflow/AccumGradebookRpt%20%281%29.txt. It shows me chinese characters but it messes with the tabs in Excel – Ankur Nov 28 '12 at 10:49
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/20239/discussion-between-jlordo-and-ankur) – jlordo Nov 28 '12 at 10:58
  • Just one word to add: http://utf8everywhere.org. – Pavel Radzivilovsky Dec 16 '12 at 08:27

3 Answers3

1

That means "UTF-16LE", and every java platform implementation is required to support it.

response.getOutputStream().write(reportHTML.getBytes("UTF-16LE"));

The notepad unicode encoding also inserts the UTF-16LE BOM FF FE at the start of the file.

Esailija
  • 138,174
  • 23
  • 272
  • 326
  • this is the file generated in UTF-16LE https://dl.dropbox.com/u/99923120/stackoverflow/AccumGradebookRpt%20%282%29.txt and this is the snap of how it opens in excel https://dl.dropbox.com/u/99923120/stackoverflow/excel%20snap%20utf-16.png – Ankur Nov 28 '12 at 11:14
  • @Ankur the text file you gave me is either encoded in UTF-32LE or double encoded UTF-16. At closer inspection it looks like it's double encoded UTF-16 because of sequences like `4E005300`. Try a simple test like writing `"ääöö".getBytes("UTF-16LE")` to a file and see if it works – Esailija Nov 28 '12 at 11:16
  • I tried the code you gave me and this is the content type `"text/plain; charset=utf-16le"` – Ankur Nov 28 '12 at 11:19
  • @Ankur the content type doesn't matter in this case, as I said, the file is incorrectly double encoded in utf16. Try the simple test I suggested and you'll see there is a mistake in your code somewhere that causes double encoding. – Esailija Nov 28 '12 at 11:22
  • I tried "ääöö".getBytes("UTF-16LE") but it doesn't work. This is the code `return "ääööhajsgagdasfdjhDNJDSHFJHDJ".getBytes("UTF-16LE");` andf this is the text file https://dl.dropbox.com/u/99923120/stackoverflow/AccumGradebookRpt%20%286%29.txt – Ankur Nov 28 '12 at 11:23
  • @Ankur what does that give? Look at the bytes of the file. – Esailija Nov 28 '12 at 11:24
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/20245/discussion-between-ankur-and-esailija) – Ankur Nov 28 '12 at 11:27
0

Try add BOM to first byte of file. http://en.wikipedia.org/wiki/Byte_order_mark

Dima Kurilo
  • 2,206
  • 1
  • 21
  • 27
  • I tried adding BOM for UTF-16LE using http://stackoverflow.com/a/713255/662250 but instead of adding `FF FE` it adds `OA 00 00 00` – Ankur Nov 28 '12 at 10:20
  • Can you add manually? The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF – Dima Kurilo Nov 28 '12 at 10:32
  • I did as you said and it worked and exported the text file as UTF-8, but then it started messing with the tab characters, Here is the link to the exported file https://dl.dropbox.com/u/99923120/stackoverflow/AccumGradebookRpt%20%281%29.txt – Ankur Nov 28 '12 at 10:40
0

In a Windows environment, when an encoding is called "Unicode" then it usually refers to UCS-2 or UTF-16.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • I tried UCS-2 and it says `java.io.UnsupportedEncodingException: UCS-2` and UTF-16 doesn't give me the desired result – Ankur Nov 28 '12 at 10:11
  • 1
    In Notepad, “Unicode” means specifically little-endian UTF-16, UTF-16LE. – Jukka K. Korpela Nov 28 '12 at 10:52
  • this is the file generated in UTF-16LE https://dl.dropbox.com/u/99923120/stackoverflow/AccumGradebookRpt%20%282%29.txt and this is the snap of how it opens in excel https://dl.dropbox.com/u/99923120/stackoverflow/excel%20snap%20utf-16.png – Ankur Nov 28 '12 at 11:15
  • 1
    So the `à` is your problem? Then it's not an encoding problem at all! You're CSV contains HTML/XML character references, which are not known/understood outside of HTML/XML! That's quite a different topic altogether! We could have avoided this long-winded problem-finding if you had described the error behaviour in more detail in the beginning. – Joachim Sauer Nov 28 '12 at 11:32