-1

I have a problem with UTF-8 encoding in java. I have an UTF-8 encoded .txt file. I have checked in notepad++ that the file actually is UTF-8 encoded. I try to read the file, but the special letters are not shown correctly.

I use the following peace of code:

        try {

        Scanner sc = new Scanner(new FileInputStream("file.txt"), "UTF-8");

        String str;

        while(sc.hasNextLine()) {
            str = sc.nextLine();
            roadNames.add(str);
            System.out.println(str);
        }

        sc.close();

    } catch(IOException e1) {
        System.out.println("The file was not found....");
    }

It shows the special letters correctly in eclipse where I have defined the default encoding to be UTF-8, but not in my generated jar file.

The only thing that actually works for me, is to make a .bat file with the following arguments "java -Dfile.encoding=utf-8 -jar executable.jar" but I do not think that is a good solution.

Furthermore, this also works:

PrintStream out = new PrintStream(System.out, true, "UTF-8"); 
out.println(str);

Update

When I say

The special letters are not shown correctly

I mean that the System.out.println prints a string where the special letters are replaced by ├à in stead of å for example.

It turns out the

PrintStream out = new PrintStream(System.out, true, "UTF-8"); 
out.println(str);

does not work afterall - sorry about that.

The real problem is not that I want the console to print out what is inside the text document, but each line in the text document contains a name, and this name is added to an ArrayList. Then I have a JTextField which, when I begin typing inside it, tries to autocomplete what I typed by searching for the best matching name inside the ArrayList. This works perfectly if it was not for the encoding problem because the special letters inside the JTextField is not show correctly. It is only shown correctly when I use the Dfile.encoding=utf-8 argument

user2403175
  • 11
  • 1
  • 2
  • What do you mean "It shows the special letters correctly"? You see your non-latin symbol in eclipe but when you open the .bat file in windows you see a ... ? If you're not seeing the correct character it's probably because you don't have the correct language pack installed on windows. This isn't a Java question. If the bytes in the file are correct then it has nothing to do with Java – Christian Bongiorno May 20 '13 at 20:58
  • 1
    Where are trying to show them? > but the special letters are not shown correctly. – Paulo Bu May 20 '13 at 20:59
  • 1
    If you're talking about this line: `System.out.println(str);` then the problem is, that your console can't show those characters. Other than that, you are doing everything right. – jlordo May 20 '13 at 21:02
  • The commando prompt does not display characters like æ,ø,å correctly, but with some weird symbol, characters, é is not shown correctly either but also with some weird symbols. This only work in eclipse, and not even when I run the jar file with the Dfile.encoding=utf-8 argument. I have a JTextfield which contains a string from one of the strings inserted in the roadNames ArrayList. The JTextField shows æ, ø, å correctly when i use the Dfile.encoding=utf-8 argument, but not otherwise – user2403175 May 20 '13 at 21:16
  • Are you reading the `roadNames` list from a file? If so, what do you see when you run the command `type file.txt` in the your `cmd.exe` console? It sounds like your Windows settings are to blame; you can get more background [here.](http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using) – erickson May 21 '13 at 17:59
  • Okay, please see my update. Looks like your problem is actually decoding, not encoding. – erickson May 22 '13 at 20:41

2 Answers2

1

Java will use the platform default encoding, unless you specify something else.

It sounds like your platform default (a Windows setting) is not UTF-8, so in the cases where you don't specify the file.encoding property, or provide the encoding to the PrintStream constructor, the default encoding is used. In this case, when a character is found that cannot be encoded, that encoder's replacement character is used instead. This is usually '�' or '?'.

The operating system is indicating that it may not be able to display some of the characters you wish to print. You can ignore that hint, and hope for the best, or you can replace the troublesome characters with something that is guaranteed to display. The default is to replace; you have to be explicit if you want to use the more risky approach.


Update: Based on the information provided in updates to the original question, it sounds like the problem lies in reading the file, not its output.

Using the platform default encoding is an exceptional case. The general pattern you should follow is to specify the encoding explicitly each time you are decoding a sequence of bytes to a string of characters. The encoding is inherent to the stream you are reading, and generally independent of the system that your code happens to be running on. Exceptions would be when you are reading from the console, or similar. Otherwise, there should be some metadata or convention that specifies the encoding, like an HTTP header, an attribute embedded in the file, or some standard that requires a particular encoding.

Here's how to read your road names from a UTF-8–encoded file:

Set<String> roadNames = new TreeSet<>();
try (InputStream bytes = new FileInputStream("file.txt")) {
  /* See how I'm specifying the UTF-8 encoding explicitly? */
  Reader chars = new InputStreamReader(bytes, StandardCharsets.UTF_8);
  BufferedReader lines = new BufferedReader(chars);
  while (true) {
    String line = lines.readLine();
    if (line == null)
      break;
    roadNames.add(line);
  }
}
erickson
  • 265,237
  • 58
  • 395
  • 493
0

I had the same problem. Use Charset.forName("cp866") and it should help.

BufferedReader brI = new BufferedReader(new InputStreamReader(cmd.getInputStream(), Charset.forName("cp866")));
        String result;
        while ((result = brI.readLine()) != null){
            System.out.println(result);
        }
ilya_kas
  • 159
  • 1
  • 6