15

For some reason a String that is assigned the letter å by using the Scanner class does not equal a String that is assigned å by using the "normal" way: String a = "å" - Why is that?

import java.util.*;

public class UTF8Test {
public static void main(String [] args) {

    String [] Norge = {"løk", "hår", "vår", "sær", "søt"};

    Scanner input = new Scanner(System.in);

    String  test = input.nextLine();  //I enter løk here
    System.out.println(test);
    System.out.println(Norge[0]);

    for(int i = 0; i < Norge.length; i++) {
        if(Norge[i].equals(test) ) {
            System.out.println("YES!!");
        }
    }
}
}

The compiler will show this:

løk

løk

l├©k

Sing Sandibar
  • 714
  • 4
  • 15
  • 26
  • Where exactly did you ensure that the characters from `System.in` are fed and interpreted using UTF-8? I'm not seeing that anywhere in the code. Thus, your code is assuming that the platform's default charset (as identified by `Charset.defaultCharset()`) is already UTF-8. Is that true? – BalusC Nov 13 '13 at 15:19
  • @BalusC I have not ensured that the characters from `System.in`are interpreted using UTF-8. How do I do that? – Sing Sandibar Nov 13 '13 at 15:22
  • also, you say "The compiler with show this" with 3 lines, but the output you list doesn't seem to match what you code does. – LordOfThePigs Nov 13 '13 at 15:22
  • is it System.out.println(test), or System.out.println(Norge[0]) which is printing the correct string? – LordOfThePigs Nov 13 '13 at 15:23
  • 2
    Depends on the runtime environment. The `├©` as mojibaked form of `ø` suggests that the original environment is using CP850 instead of UTF-8. The CP850 is by default used in Windows command console. This suggests that you were running this in Windows command console instead of in an UTF-8 capable IDE like Eclipse. You should be able to confirm this by just printing/examining the outcome of `Charset.defaultCharset()`. – BalusC Nov 13 '13 at 15:25
  • @LordOfThePigs `test`is showing the correct String. `Norge[0]` is showing the one with the messed up letter. And the output was pasted from the command line window, and is exactly what is showing. – Sing Sandibar Nov 13 '13 at 15:26
  • @SingSandibar I see, so the first line is actually your input. Seems to me you're not giving the compiler the correct encoding as input. – LordOfThePigs Nov 13 '13 at 15:28
  • 1
    The compiler doesn't play a role during runtime/input/output at all. The compiler only plays a role during turning `.java` files into `.class` files. – BalusC Nov 13 '13 at 15:39
  • Maybe someone should actually try examining the `.class` file, to see how the compiler is representing the strings in its output. – AJMansfield Nov 13 '13 at 15:54
  • This is almost certainly some sort of code page issue. But sorting out such issues can be a real challenge at times. – Hot Licks Nov 13 '13 at 15:54
  • Actually, I'm installing a hex editor into eclipse right now, to test this. – AJMansfield Nov 13 '13 at 15:56
  • Guys, did nobody read my comment that the `├©` as CP850-mojibaked form of `ø` suggests that the OP is using Windows command prompt to run `javac` on an UTF-8 saved source code file? Messing around in Eclipse let alone in hex editors (wtf?) won't give you clues on that. – BalusC Nov 13 '13 at 16:05
  • Well, assuming you run javac from the command prompt, using a hex editor on the `.class` file can tell you which characters the compiler put there. Compiling with eclipse may not be very relevant though. – LordOfThePigs Nov 13 '13 at 16:13
  • I am from norway and I absolutely hate cmd.exe for this specific reason, there is no workaround, even changing from CP850 to 65001 (and UTF8 source), it still spews jibberish. – arynaq Nov 13 '13 at 16:48

5 Answers5

7

Provided that your sole requirement is being able to use UTF-8 everywhere as indicated by the UTF8Test classname, then your main mistake is that you're using Windows command console to compile and run your Java program. The ├© as mojibaked form of ø namely strongly suggests that you were using CP850 encoding to compile your Java source code file. As evidence, run this in an UTF-8 capable environment:

System.out.println(new String("ø".getBytes("UTF-8"), "CP850"));

This prints ├©. This in turn strongly suggests that you were using Windows command console to compile your Java source code file as that's currently the only commonly used environment which uses CP850 by default. However, the Windows command console is not UTF-8 capable.

When you save (convert from chars to bytes) the source code file using UTF-8 encoding in your text editor, then the ø character is turned into the bytes 0xC3 and 0xB8 (as evidence, see "UTF-8 (hex)" entry in U+00F8 character info). When you run javac UTF8Test.java, then the UTF-8 saved source code file is basically read (converted from bytes to characters) using CP850 encoding. The bytes 0xC3 and 0xB8 represent in this encoding the characters and © (as evidence, see CP850 codepage layout). This totally explains your initial problem.

True, you can instruct javac to read the source code file using UTF-8 by the -encoding UTF-8 argument. However, the Windows command console at its whole own does not support UTF-8 flavored input and output at all. When you recompile using -encoding UTF-8, then you would still get mojibaked output because the command console can't properly represent UTF-8 output. I tried it here and I got a degree symbol instead:

løk
l°k

This problem is not solveable if you intend to use UTF-8 everywhere and want to stick to Windows command console as input/output environment. Basically, you need an UTF-8 capable input/output environment. Decent IDEs like Eclipse and Netbeans are such ones. Or, if you intend to run it as an UTF-8 capable standalone program, using a Swing UI should be preferred over a GUI-less console program.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • "If you intend to run it as a standalone program, using a Swing UI should be preferred over a GUI-less console program." Not so. Anything that can be a gui-less console program, should. If you want GUI, then write a second program that feeds GUI input to API exposed by the first program. – AJMansfield Nov 13 '13 at 16:51
  • Much more detailed and understandable answer than mine. It still boils down to the same, but I like yours better :-) – LordOfThePigs Nov 14 '13 at 08:20
  • @Ingo: No, Windows codepage 65001 doesn't generally work for console windows. A console window may arbitrarily swallow output that follows undesired characters. And input doesn't work at all. – Cheers and hth. - Alf Feb 14 '15 at 10:08
4

If you want to have a string literal with a special character, you can try using a Unicode escape:

String [] Norge = {"l\u00F8k", "h\u00E5r", "v\u00E5r", "s\u00E6r", "s\u00F8t"};

While it is not wrong to include special characters in source code (at least in java), it can in some cases cause problems with poorly configured editors, compilers, or terminals; Personally I steer clear of using special characters at all if I can.

Incidentally, you can also use Unicode escapes elsewhere in java source code, including javadoc comments, and class, method, and variable names.

If you are compiling from the command line, you can configure the compiler to accept UTF-8 by using the -encoding option with UTF-8 as its parameter. Like so:

javac -encoding UTF-8 ...

You may also find this question useful: Special Character in Java


You might consider externalizing the strings, as an alternate way to solve the problem. Eclipse provides a way to automatically do this, but it basically just takes all the literal strings, puts them in a separate file, and reads from that file to get the appropriate string. This also allows you to create a translation of the program, by making a different file with translations of all the strings, or to reconfigure application messages without having to recompile.


EDIT: I just tried compiling and running it myself (in eclipse), and I did not have the problem with it you mention. It is therefore likely an issue with your particular setup.

When I reconfigured it to compile the code as US-ASCII, it output l?k both times.

When I reconfigured it to compile the code as UTF-8, the output was løk and løk.

When I compiled it as UTF-16, the output was þÿ l ø k and løk, however I could not copy the blank spaces in þÿ l ø k from the terminal: it would let me copy the first two, but leave off the rest. This is probably related to the issue you were having - they could be some control characters that are messing it up in your case.

Community
  • 1
  • 1
AJMansfield
  • 4,039
  • 3
  • 29
  • 50
  • 2
    OP's concrete problem isn't caused by a wrongly saved source code file. Besides, this is not 1990 anymore. Modern editors save source code files using UTF-8. You're still not answering the concrete problem. – BalusC Nov 13 '13 at 15:36
  • @BalusC it may be caused by that, you never know. String externalization is still good though. – AJMansfield Nov 13 '13 at 15:40
  • @BalusC read http://stackoverflow.com/questions/12445635/special-character-in-java. – AJMansfield Nov 13 '13 at 15:43
3

By default on windows, the java compiler interprets all of its source file using the "platform default encoding". Depending on which environment you are running the compiler, this may be ISO-8859-1, CP1252, UTF-8 or any other encoding really.

If the editor you are using is actually encoding your java source files using UTF-8, but the compiler is reading those source files using another encoding, then the contents of all your hardcoded string may potentially be screwed (as you have experienced). To fix this problem, either make sure you save your java source file in the "platform default encoding", or setup your java compiler to interpret the source files as UTF-8.

try calling your compiler with javac -encoding UTF-8 UTF8Test.java. Make sure you replace UTF-8 with whatever your editor is using to save your source file, if necessary.

LordOfThePigs
  • 11,050
  • 7
  • 45
  • 69
  • The ISO-8859-1-mojibaked variant of `ø` is `ø`. However, the OP got a `├©`. So your answer is basically wrong. Evidence: `System.out.println(new String("ø".getBytes("UTF-8"), "ISO-8859-1"));` (do this in an UTF-8 capable environment!) – BalusC Nov 13 '13 at 15:23
  • well, if this guy is using the norwegian codepage, he may actually be using ISO-8859-4 or ISO-8859-10. I'm not sure how these would be translated, but I still think its possible. – LordOfThePigs Nov 13 '13 at 15:26
  • Sorry no, any ISO-8859-X mojibaked variant of a 2-byte UTF-8 character starts with `Ã` (0xC3) – BalusC Nov 13 '13 at 15:28
  • Oh, so you think the encoding is wrong on the other side? If that was the case, wouldn't Norge[0] be the one that prints correctly. I believe System.out does use the default platform charset, doesn't it? Or is it the windows command prompt which is particularily stupid and uses another encoding than the rest of the system, and can't handle whatever java prints to it? – LordOfThePigs Nov 13 '13 at 15:32
  • @BalusC: I've edited my answer a little bit, to remove the reference to the wrong encoding. seems more correct now? The concept is still the same though. – LordOfThePigs Nov 13 '13 at 15:43
1

If you are working in Eclipse, Change your console encoding, using RUN menu > Run configurations.. > Common tab (Right hand side) > In encoding panel > select Other=UTF-8


enter image description here

Somnath Kadam
  • 6,051
  • 6
  • 21
  • 37
-1

I had a issue with displaying the norwegian characters. Try using the encoding: ISO 8859- 10

Ravindra S. Patil
  • 11,757
  • 3
  • 13
  • 40