7

I have a text file with a strange encoding "UCS-2 Little Endian" that I want to read its contents using Java.

Opening the text file using NotePad++

As you can see in the above screenshot the file contents appear fine in Notepad++, but when i read it using this code, just garbage is being printed in the console:

String textFilePath = "c:\strange_file_encoding.txt"
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF8" ) );
String line = "";

while ( ( line = reader.readLine() ) != null ) {
    System.out.println( line );  // Prints garbage characters 
}

The main point is that the user selects the file to read, so it can be of any encoding, and since I can't detect the file encoding I decode it using "UTF8" but as in the above example it fails to read it right.

Is there away to read such strange files in a right way ? Or at least can i detect if my code will fail to read it right ?

jeemar
  • 548
  • 5
  • 15
Brad
  • 4,457
  • 10
  • 56
  • 93

3 Answers3

7

You are using UTF-8 as your encoding in the InputStreamReader constructor, so it will try to interpret the bytes as UTF-8 instead of UCS-LE. Here is the documentation: Charset

I suppose you need to use UTF-16LE according to it.

Here is more info on the supported character sets and their Java names: Supported Encodings

Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
tempoc
  • 377
  • 3
  • 7
  • Thanks a lot. As described in my question the main problem is that this is not the only text file used. The user selects the file to read, and it can have any encoding, so will "UTF-16LE" read any text file having any encoding ? – Brad Mar 19 '13 at 22:41
  • There's no surefire way, but give this a shot: [juniversalchardet](https://code.google.com/p/juniversalchardet/) – tempoc Mar 19 '13 at 22:51
1

You're providing the wrong encoding in InputStreamReader. Have you tried using UTF-16LE instead if UTF8?

BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF-16LE" ) );

According to Charset:

UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order

Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
1

You cannot use UTF-8 encoding for all files, especially if you do not know which file encoding to expect. Use a library which can detect the file encoding before your read the file, for example: juniversalchardet or jChardet

For more info see Java : How to determine the correct charset encoding of a stream

Community
  • 1
  • 1
Dror Bereznitsky
  • 20,048
  • 3
  • 48
  • 57