2

Possible Duplicate:
How to add a UTF-8 BOM in java

My oracle database has a character set of UTF8. I have a Java stored procedure which fetches record from the table and creates a csv file.

BLOB retBLOB = BLOB.createTemporary(conn, true, BLOB.DURATION_SESSION);
retBLOB.open(BLOB.MODE_READWRITE);
OutputStream bOut = retBLOB.setBinaryStream(0L);
ZipOutputStream zipOut = new ZipOutputStream(bOut);
PrintStream out = new PrintStream(zipOut,false,"UTF-8");

The german characters(fetched from the table) becomes gibberish in the csv if I use the above code. But if I change the encoding to use ISO-8859-1, then I can see the german characters properly in the csv file.

PrintStream out = new PrintStream(zipOut,false,"ISO-8859-1");

I have read in some posts which says that we should use UTF8 as it is safe and will also encode other language (chinese etc) properly which ISO-8859-1 will fail to do so.

Please suggest me which encoding I should use. (There are strong chances that we might have chinese/japanese words stored in the table in the future.)

Community
  • 1
  • 1
Fadd
  • 770
  • 2
  • 8
  • 19
  • 1
    Something isn't adding up. You claim that the database has the text stored as UTF-8, but when you write out the text in UTF-8 it's gibberish; that it has to be written out in ISO-8859-1 to be readable. This seems like pretty obvious evidence that the database's text is not being stored as UTF-8 but rather as ISO-8859-1. – JUST MY correct OPINION Dec 08 '10 at 09:35
  • I checked the NLS_CHARACTERSET of the database and it has the value UTF8. One interesting thing, I could open the csv using a notepad and I could see those characters properly. – Fadd Dec 08 '10 at 09:49
  • This is resolved. Please check this [link](http://stackoverflow.com/questions/4389005/how-to-add-a-utf-8-bom-in-java) – Fadd Dec 09 '10 at 07:01
  • This actually may help you: http://weblogs.java.net/blog/joconner/archive/2010/03/24/writing-csv-files-utf-8-excel – Boris Pavlović Dec 08 '10 at 09:11

3 Answers3

3

You're currently only talking about one part of a process that is inherently two-sided.

Encoding something to bytes is only really relevant in the sense that some other process comes along and decodes it back into text at some later point. And of course, both processes need to use the same character set else the decode will fail.

So it sounds to me that the process that takes the BLOB out of the database and into the CSV file, is assuming that the bytes are an ISO-8859-1 encoding of text. Hence if you store them as UTF-8, the decoding messes (though the basic ASCII characters have the same byte representation in both, which is why they still decode correctly).

UTF-8 is a good character set to use in almost all circumstances, but it's not magic enough to overcome the immutable law that the same character set must be used for decoding as was used for encoding. So you can either change your CSV-creator to decode with UTF-8, else you'll have to continue encoding with ISO-8859-1.

Andrzej Doyle
  • 102,507
  • 33
  • 189
  • 228
  • Hi Andrzej, How come I can open the same file with notepad? – Fadd Dec 08 '10 at 10:37
  • Probably because notepad tries multiple character sets, or analyses the file, or just happens to guess the right one? I don't know. But what I can say is that you can open the file in notepad, because notepad is decoding the bytes on disc with the correct character set ("correct" being the one they were encoded in). – Andrzej Doyle Dec 08 '10 at 12:35
0

I suppose your BLOB data is ISO-8859-1 encoded. As it's stored as binary and not as text its encoding is not depended on the databases encoding. You should check if the the BLOB was originaly written in UTF-8 encoding and if not, do so.

morja
  • 8,297
  • 2
  • 39
  • 59
  • How can I check which encoding was used for creating the BLOB? – Fadd Dec 08 '10 at 09:57
  • Well, by decoding it with an encoding that results in the correct display of the characters. There are some tools that try to detect an encoding, but if your BLOB decodes well with ISO-8859-1 then it is ISO-8859-1 encoded. – morja Dec 08 '10 at 10:05
0

I think the problem is [Excel]csv could not figure out the utf8 encoding. utf-8 csv issue

But I m still not able to resolve the issue even if I put a BOM on the PrintStream.

PrintStream out = new PrintStream(zipOut,false,"UTF-8"); 
out.write('\ufeff');

I also tried:

out.write(new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF });

but to no avail.

Community
  • 1
  • 1
Fadd
  • 770
  • 2
  • 8
  • 19