3

I have a very simple bit of Scala code

 var str = "≤"
 for( ch <- str ) { printf("%d, %x", ch.toInt, ch.toInt) ; println  }
 println
 str = "\u2264" ;
 for( ch <- str ) { printf("%d, %x", ch.toInt, ch.toInt) ; println }

In case that doesn't show properly on your browser, the first string contains one character, between double-quotes, which is the less-or-equal-to sign U+2264.

The program outputs

8218, 201a
226, e2
167, a7

8804, 2264

Clearly the first string is 3 characters long at run time, not 1 character long as it is in the source file.

The source file is stored in UTF-8. A hex dump shows that it is encoded properly, the first string being 22 E2 89 A4 22. I'm using Eclipse and the Scala plugin for Eclipse.

  • Does the scala compiler accept input files encoded in UTF-8?
  • If so, why does my program produce unexpected results?
Theodore Norvell
  • 15,366
  • 6
  • 31
  • 45
  • 1
    “Does the Scala compiler work with UTF-8 encoded source files?” The answer is yes. – Randall Schulz Apr 22 '14 at 16:09
  • Curiously if I tell eclipse to change the encoding to MacRoman, it displays the string as three characters. If I then edit it back to one and save, the string is saved as one character: B2. Compile, run. It works! So it seems that, if the file is encoded in UTF-8, eclipse is failing to inform the scala compiler that this is so and the scala compiler is proceeding as if the file were in some other encoding. This explains everything except the specific 3 characters. Why 201a e2 a7 rather than e2 89 a4? I don't really care about that. I do want to know how to tell Scala what encoding to assume. – Theodore Norvell Apr 22 '14 at 16:57
  • 1
    Use the `-encoding` option when compiling your code on the command line with `scalac` to specify the encoding of the source file. For example: `scalac -encoding UTF-8 MyProgram.scala` – Jesper Apr 22 '14 at 17:31
  • Thanks @Jesper. That provided a good solution. As a bonus, when I do this, my MacRoman encoded files that contain nonascii characters give a compile time error. So I can't accidentally use the wrong encoding, unless all the characters are between 0000 and 00FF -- in which case it doesn't matter. – Theodore Norvell Apr 22 '14 at 17:46

3 Answers3

7

To answer my own questions:

Does the scala compiler work with UTF-8 encoded files?

Yes, but only if it knows they are UTF-8 encoded. In the absence of any other evidence, it uses Java's file.encoding property. (Thanks to @AndreasNeumann for this part of the answer.)

Why did my program not behave as I expected?

Because my file.encoding property was set to MacRoman. Even though I had told eclipse that the file is UTF-8, this information was not communicated to the Scala compiler. Thus the compiler interpreted the 3 byte sequence E2 89 A4 as a three character sequence according to the MacRoman encoding: a lower single quote (which looks a lot like a comma), an "a" circumflex, and a section symbol. The unicode for this 3 character sequence was U+201A U+00E2 U+00A7, which explains the output of my program.

How do you fix the problem?

On the command line for scalac use the option -encoding UTF-8. In eclipse you can use the preferences (options) for the Scala plugin to add this option. (Thanks to @Jesper for this part of the answer.) You can also use the -D option either on the scalac command line or via theJAVA_OPTS environment variable to set the file.encoding property. (See the answer of @AndreasNeumann for details.)

If you use the Scala IDE for Eclipse, there are at least three things you can do.

  • One is to set the default encoding for all your workspaces under General >> Workspace in Eclipse's global preferences (or options), as shown in Iulian Dragos's answer.
  • In the project properties (right-click on the project in the Package Explorer an select Properties), under the Resource preferences, select UTF-8 as the Text file encoding.
  • Finally, you can add -encoding UTF-8 under additional command line parameters under Compiler >> Scala in the preferences (or options). You can set this as a global preference (or option) or as a project specific property setting. Image of Eclipse preferences dialog
Theodore Norvell
  • 15,366
  • 6
  • 31
  • 45
  • You should set the default encoding in Workspace preferences. The IDE will add -encoding anyway, so now you're at the mercy of command line parsing, and which -encoding takes precedence – Iulian Dragos Aug 04 '14 at 12:49
  • @IulianDragos Thanks. I thought I'd tried that, but I guess not at the global level. Is there a project specific way to do this? In any case I'll edit my answer. – Theodore Norvell Aug 04 '14 at 21:46
  • Ok. Found a project specific solution. Will edit again. – Theodore Norvell Aug 04 '14 at 21:58
3

Yes Scala fully supports UTF-8.

I can't reproduce your results. MacOS X, Java 7, Scala 2.10.4.

Check the file encoding of your system:

scala> System.getProperty("file.encoding")
res0: String = UTF-8

Add this line to your .bashrc . This might fix the problem in some *nix environments.

export JAVA_OPTS='-Dfile.encoding=UTF-8'

Sometimes the IDE is set to the wrong file encoding. You could check this also.

Andreas Neumann
  • 10,734
  • 1
  • 32
  • 52
  • Thanks. I'm using eclipse and the interpreter there says "MacRoman". I guess I was thinking that if I told Eclipse that a file was in UTF-8, it would somehow communicate this to the compiler. – Theodore Norvell Apr 22 '14 at 17:05
  • Hmm… The number of scripts that think they own JAVA_OPTS is excessive. I've never liked that approach… – Randall Schulz Apr 22 '14 at 17:42
  • It has a lot benefits when working in an polyglotal project on the JVM. Otherwise it's quite hard to guarantee the interoperability. – Andreas Neumann Apr 22 '14 at 17:52
1

The Scala plugin respects the encoding settings of Eclipse. You can set the workspace default in Preferences. If that doesn't trickle down to your sources, check if there is an overriding encoding at the project or source folder level.

Workspace Preferences

For example, here is the property page of a source folder:

enter image description here

Iulian Dragos
  • 5,692
  • 23
  • 31