Java - OS X - Unicode mangled string

Question

I'm processing a Unicode text file using the Java platform on OS X. When I open the file using TextEdit or TextWrangler instead of seeing "Nattvardsgästerna" I see "Nattvardsg‰sterna" (which is incorrect). When I open the file using the Java io stream, I see the same incorrect String "Nattvardsg‰sterna".

When I open the file on my PC I see the correct String. I'm not sure where to start solving this problem... Is it an issue with my OS X set-up? Should I open the Java stream with a special flag?

Thanks.

P.S. I'm opening the file like so: fileReader = new BufferedReader(new FileReader(file));

P.S.S. Also, I should mention that I'd like to output the result as an SQL text file so it is important for the OS to distinguish ä correctly.

What's the encoding of the file? How are you opening the file in Java? — Jon Skeet, Jan 18 '13 at 22:38
It turns out that the textWrangler or textEdit can't guess the characterset of the file I'm trying to open. However using TextWrangler I was albe to cycle through a few encoding and realized that Western (ISO Latin 1) seems to do the trick. — hba, Jan 18 '13 at 23:49

score 3 · Accepted Answer · edited May 23 '17 at 10:24

An InputStream reads bytes (not characters), so I assume when you say:

When I open the file using java io stream

... that you really mean "when I open the file using a Java Reader".

EDIT: Your comment says that you're doing this:

new BufferedReader(new FileReader(file));

An InputStreamReader has a constructor that allows you to set the character encoding. If you don't specify one, it will use the platform default. It's unlikely the platform default will be unicode (on my Macbook, it's set to "US-ASCII").

In order to set the character encoding, you must create the intermediate input stream reader rather than that letting FileReader do it for you (because FileReader uses the platform default encoding).

Assuming the file is encoding using UTF-8, use:

new BufferedReader(new InputStreamReader(new FileInputStream(file), 
                                         Charset.forName("UTF-8")));

Alternatively, you can change the platform default by supplying an argument to the JVM. You can look at this answer for the full details, but the basic idea is that you set the file.encoding Java system property. The linked answer provides a few ways to achieve this.

FURTHER EDIT:

P.S.S. Also, I should mention that I'd like to output the result as an SQL text file so it is important for the OS to distinguish ä correctly.

The OS hasn't got anything to do with this. The file system is just shuffling bytes around. How those bytes are interpreted is entirely up to the applications that are reading those files. This answer tells you how to make your Java program interpret the bytes correctly. For your database to be able to interpret the bytes correctly, you'll need to configure the database encoding.

Sorry, I should have specified this, but I'm using the following to open the file: new BufferedReader(new FileReader(file)) — hba, Jan 18 '13 at 23:20
@hba: I've made a slight edit so that you end up with a `BufferedReader`. In short, you must create the `InputStreamReader` yourself so that you can specify the character encoding. — Greg Kopff, Jan 18 '13 at 23:27
Thanks, I'll follow your suggestion, but ultimately, I'm worried that my solution won't work, because if mysqldev doesn't recognize the characterset then the sql strings generated by my app will be useless. — hba, Jan 18 '13 at 23:36
@hba: The OS hasn't got anything to do with this. The file system is just shuffling bytes around. How those bytes are interpreted is entirely up to the applications that are reading those files. This answer tells you how to make your Java program interpret the bytes correctly. For your database to be able to interpret the bytes correctly, you'll need to configure the database encoding. — Greg Kopff, Jan 18 '13 at 23:42
Gerg, you're correct, I had to cycle through a few encodings to find the right one, now all my apps are working as expected (as well) - thanks. — hba, Jan 19 '13 at 01:16

Java - OS X - Unicode mangled string

1 Answers1