1

I'm writing a project which parses a UTF-8 encoded file.

I'm doing it this way

ArrayList<String> al = new ArrayList<>();
BufferedReader bufferedReader = new BufferedReader(new         
                                InputStreamReader(new FileInputStream(filename),"UTF8"));

String line = null;

while ((line = bufferedReader.readLine()) != null)
{

    al.add(line);
}

return al;

The strange thing is that it reads the file properly when I run it in IntelliJ, but not when I run it through java -jar (It gives me garbage values instead of UTF8).

What can I do to either

  1. Run my Java through java -jar in the same environment as intelliJ or
  2. Fix my code so that it reads UTF-8 into the string
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Charles Shiller
  • 1,013
  • 4
  • 13
  • 32

2 Answers2

1

I think that what is going on here is that you just don't have your terminal setup correctly for your default encoding. Basically, if your program runs correctly, then it's grabbing the UTF-8 bytes, storing them as Java strings, then outputting them to the terminal in whatever the default encoding scheme is. To find out what your default encoding scheme see this question. Then you need to ensure that your terminal that you are running your java -jar command from is compatible with it. For example, see my terminal settings/preferences on my Mac.

Mac Terminal Settings for UTF-8

Community
  • 1
  • 1
entpnerd
  • 10,049
  • 8
  • 47
  • 68
  • Imagine a piece of software you're shipping to your clients. Your software has a bug - it prints incomprehensible characters to the output stream. But instead of fixing it directly in your software (like I suggest in my answer), you release a public note telling your customers to change their terminal settings. What if some clients run a system without GUI? How do they change the settings there? Would you release a public note describing how to change settings for every possible platform? Doesn't seem like a solution to me. – Kevin Kopf Mar 31 '16 at 04:36
  • I guess what I'm saying is that there probably isn't a bug in your code. Assuming the code runs, the snippet of code you have above should be reading and storing UTF-8 characters correctly in strings. I was surmising you were just doing `System.out.println()`. How exactly are you outputting this data? – entpnerd Mar 31 '16 at 04:43
  • Except that it *also* doesn't output to file right either – Charles Shiller Mar 31 '16 at 06:39
  • Charles, assuming that you have the same problem when you output to file as you do to terminal, what happens when you run the command `cat foo.txt | xxd`, and what special UTF-8 characters do you have in that file? Basically, I'm suspicious that the code is fine, but whatever UI you are using to view these UTF-8 characters isn't configured correctly. – entpnerd Mar 31 '16 at 06:44
  • @Nordenheim - No, it's the other way round; entpnerd has a good solution - one should write their code so it is agnostic to the environment and allow the JVM to work out the correct encoding, according to the environment. You, on the other hand, have suggested to fix the input to UTF-8. How would that work on the Windows, when a user has saved a file as "cp1252", or me, who has a `locale` of `en_GB.ISO8859-1` and an encoding in my terminal to match. entpnerd has highlighted the common situation where people's terminal (beit Terminal, iTerm, Putty, etc) does not match the locale. – Alastair McCormack Mar 31 '16 at 13:14
  • When we know what environment @CharlesShiller, is working in, then we'll know how appropriate this answer is. – Alastair McCormack Mar 31 '16 at 13:40
0

Oracle docs give a pretty straightforward answer about Charset:

Standard charsets

Every implementation of the Java platform is required to support the following standard charsets. Consult the release documentation for your implementation to see if any other charsets are supported. The behavior of such optional charsets may differ between implementations.

...

UTF-8

Eight-bit UCS Transformation Format

So you should use new InputStreamReader(new FileInputStream(filename),"UTF-8"));

Kevin Kopf
  • 13,327
  • 14
  • 49
  • 66
  • I thought about that too but if that were the case, the `InputStreamReader` constructor would throw an `UnsupportedEncodingException`. – entpnerd Mar 31 '16 at 04:09
  • Yes. He is using the string constructor, not the charset constructor: https://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.lang.String) – entpnerd Mar 31 '16 at 04:17
  • @entpnerd yes I've mislooked, that's why I deleted my comment. The thing might be is that OP catches exceptions and discards them silently, who knows – Kevin Kopf Mar 31 '16 at 04:19
  • 1
    How is this answer different to the OP's question? – Alastair McCormack Mar 31 '16 at 13:38