0

I'm trying to read a file from the SD card and I've been told it's in unicode format. However, when I try to read the file I get the following:

Encoded file

This is the code I'm using to read the file:

InputStreamReader fw = new InputStreamReader(new FileInputStream(root.getAbsolutePath()+"/Drive/sdk/cmd.62.out"), "UTF-8");
char[] buf = new char[255];     
fw.read(buf);
String readString = new String(buf);
Log.d("courierread",readString);    
fw.close();

If I write that output to a file this is what I get when I open it in a hex editor: Hex info

Any thoughts on what I need to do to read the file correctly?

RichW
  • 10,692
  • 6
  • 26
  • 33

2 Answers2

2

Does the file have a byte-order mark? In that case look at Reading UTF-8 - BOM marker

EDIT (from comment): That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE".

Community
  • 1
  • 1
RoToRa
  • 37,635
  • 12
  • 69
  • 105
  • Not sure, but I tried applying the BOM removal code and it seemed to make it worse! I suppose the easiest solution is to strip out all those weird A characters - unfortunately I don't know the unicode char to do so.. – RichW Mar 28 '11 at 10:55
  • Stripping out those characters wouldn't be solving the problem. Are you sure it's a UTF-8 file? Can you look at the file in a hex editor and post a screen shot or the hex codes of the first few bytes? – RoToRa Mar 28 '11 at 11:04
  • All I know is that it's unicode. I tried UTF-16 and it was completely unreadable, it was just made up of lots of dodgy characters. As requested I've outputted the hex codes for each character (see the original post). It appears that there is a 0 in between every character.. – RichW Mar 28 '11 at 11:17
  • A single `0` doesn't make much sense between the characters. It there really were a 0 byte it would be `00`. The problem with your output, is that it has already been processed by (possibly wrong) Java code, so a look at it in an "independent" hex editor would be better... – RoToRa Mar 28 '11 at 11:28
  • It wouldn't surprise me if the original file is incorrect - the app that produces it has a lot of flaws with the SDK on Android. I've updated the original post with the output from a hex editor. – RichW Mar 28 '11 at 12:19
  • 1
    Thanks. That looks like little-endian UTF-16 to me. Try the charset "UTF-16LE". – RoToRa Mar 28 '11 at 12:24
  • Haha fantastic, worked perfectly! I really need to learn more about encoding and how to tell which one is which. If you create a separate answer I'll select that as the solution :) – RichW Mar 28 '11 at 12:32
1

The file you show in the hex editor is not UTF-8 encoded, it looks more like UTF-16. This means you must specify UTF-16 as the encoding in your code (probably the UTF-16LE variant).

If it were UTF-8 encoded, then it would represent all characters representable in ASCII using just a single byte.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • Interesting tip, thanks for that. I'll try creating different files with different types of encoding.. I guess that is the easiest way to learn the difference – RichW Mar 28 '11 at 12:37