How ensure if java program uses UTF-8 encoding

Question

I recently discovered that relying on default encoding of JVM causes bugs. I should explicitly use specific encoding ex. UTF-8 while working with String, InputStreams etc. I have a huge codebase to scan for ensuring this. Could somebody suggest me some simpler way to check this than searching the whole codebase.

Thanks Nayn

I read the post : http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java — Nayn, Jun 07 '10 at 16:21
Are you specifying encoding other than utf8 somewhere? By default strings in java is utf8 so i dont see a problem here. — Imre L, Jun 07 '10 at 16:35
@Imre: the problems will manifest whenever you read/write those characters as characters from/to an external source which expects/uses a different encoding (by default), e.g. the disk file system, a datastore (database), a network connection (HTTP), etc. — BalusC, Jun 07 '10 at 16:42
@Imre no, strings are not UTF-8 by default in Java. Strings consist of 16-bit Unicode characters. If you read text from or write text to a file, those 16-bit Unicode characters will be encoded with a platform-dependent default character encoding. The default encoding is not always UTF-8. — Jesper, Jun 07 '10 at 17:06

score 4 · Answer 1 · answered Jun 07 '10 at 16:27

4

System.getProperty("file.encoding")

returns the VM encoding for i/o operations

You can set it by passing -Dfile.encoding=utf-8

answered Jun 07 '10 at 16:27

Bozho

588,226
146
1,060
1,140

1

Please see the thread that i mentioned in the comment. The above property is internal implementation detail for specific JVM implementation. The use of this property is varying in Java 1.5 and 1.6. – Nayn Jun 07 '10 at 16:28
it isn't. Read the accepted answer fully :) this is a standard setting that determines the default charset. – Bozho Jun 07 '10 at 16:32
1

Setting a property like this to correct code is an outrageous hack. – Tom Hawtin - tackline Jun 07 '10 at 17:07
@Tom I don't share your opinion on that. While it is preferable not to rely on this (and I never do), it is legitimate to use VM parameters. – Bozho Jun 07 '10 at 17:10
I have to admit that I couldn't solve this problem without setting system property as -Dfile.encoding=utf-8. I tried every possible approach to put encoding wherever possible. – Nayn Jun 07 '10 at 20:29

BalusC · Accepted Answer · 2010-06-07T17:31:34.793

Not a direct answer, but to ease the job it's good to know that in a bit decent IDE you can just search for used occurrences of InputStreamReader, OutputStreamWriter, String#getBytes(), String(byte[]), Properties#load(), URLEncoder#encode(), URLDecoder#decode() and consorts wherein you could pass the charset and then update accordingly. You'd also like to search for FileReader and FileWriter and replace them by the first two mentioned classes. True, it's a tedious task, but worth it and I'd prefer it above relying on enrivonmental specifics.

In Eclipse for example, select the project(s) of interest, hit Ctrl+H, switch to tab Java Search, enter for example InputStreamReader, tick the Search For option Constructor, choose Sources as the only Search In option, and execute the search.

`FileReader` is the baddy. I don't know of a comprehensive list of these dangerous API methods/constructors. — Tom Hawtin - tackline, Jun 07 '10 at 17:08

score 0 · Answer 3 · answered Jun 07 '10 at 16:32

relying on default encoding of JVM causes bugs

Indeed, one should always specify the charset when encoding/decoding.

If you are satisfied a default global charset for all you encoding/decoding (not always enough), you can live with Bozho's answer : specify a known fixed default in your JVM arguments or in some static initializer.

But it's good practice to search all implicit charset specifications in your code, and replace them with a explicit charset encoding: some typical methods/classes to look at: FileWriter, FileReader, InputStreamReader, OutputStreamWriter, String#getBytes(), String(byte[]).

Noted should be that `FileWriter` and `FileReader` can't be changed to take a specified encoding. They should be replaced with `OutputStreamWriter` and `InputStreamReader` respectively. — BalusC, Jun 07 '10 at 16:34

score 0 · Answer 4 · answered Jun 07 '10 at 16:43

If the file is manipulated by native tools on the servers may want to set the encoding to System.getProperty("file.encoding"). I have run into bugs both ways.

Best practice is to know which character set is used, and set that. Also if the file is used to interface to another application, you should define the character set used. This may be a windows code page or a different UTF format.

How ensure if java program uses UTF-8 encoding

4 Answers4