13

The Logback 1.1.3 LayoutWrappingEncoder documentation doesn't indicate what the default charset will be if the user doesn't set it, but the source code says:

By default this property has the value null which corresponds to the system's default charset.

However I'm using a PatternLayoutEncoder (with a RollingFileAppender), and it seems to be outputting files in UTF-8 (and the default charset of my Windows 7 Professional system is probably not UTF-8).

UTF-8 output is actually what I want, but I want to make sure I'm not getting this by chance, since the documentation seems to indicate something else. So why is Logback giving me UTF-8 output when I haven't explicitly specified a charset?

tripleee
  • 175,061
  • 34
  • 275
  • 318
Garret Wilson
  • 18,219
  • 30
  • 144
  • 272
  • It looks like you are getting this by chance. I looked in the source code and could not find any classes calling "setCharset" in PatternLayoutEncoder. The documentation indicates with "the charset encoding chosen by the user" what is already described in [this](http://stackoverflow.com/a/13841592/3080094) fine answer. – vanOekel Aug 29 '15 at 21:39
  • But how does this "by chance" work? I'm on a Windows machine --- where is it getting the UTF-8 from? It has to come from somewhere. – Garret Wilson Sep 01 '15 at 15:47
  • 1
    The default charset (used via `getBytes()` in `LayoutWrappingEncoder`) is a [bit complicated](http://stackoverflow.com/a/12659462/3080094), but [not a mystery](http://superuser.com/a/879947). The links could help determine where the UTF-8 is coming from? – vanOekel Sep 01 '15 at 16:42
  • 1
    Ah, now we're getting somewhere --- you mentioned that LogBack uses the value from `getBytes()`, which means that the value from `Charset.defaultCharset` is used. And oddly enough... this returns UTF-8 on my Windows system! This is surprising, because I had been under the impression that `InputStreamReader` would default to something other than UTF-8 (such as `Windows-1252`) on Windows... but no, that returns `"UTF8"` as well! Maybe my Eclipse+Maven setup is doing something odd, or maybe Java 8 changed the defaults. Anyway, vanOekel, do you want to provide an answer so you can get the bounty? – Garret Wilson Sep 02 '15 at 14:58
  • Eclipse you can override the file encoding in the workspace settings. I suspect you've already set this up at UTF-8. Also in the run profile you are able to change the encoding: you'll be getting UTF-8 in eclipse because you've set your project/environment to be this. – andygavin Sep 02 '15 at 17:03
  • I've added a section below that explains the situation with eclipse, which I think is a complete answer to your query. – andygavin Sep 02 '15 at 17:10

2 Answers2

20

Logback Character Encoding

You can use <charset> in the definition of your PatternLayoutEncoder as this is a subclass of LayoutWrappingEncoder, which provides the setCharset method. This is indicated in the documentation by an excerpt from the class, but no example xml configuration is given. For the LayoutWrappingEncoder an answer has been given here: [Logback-user]: How to use UTF-8.

So if you configure via code you can call the setCharset method with UTF-8. Or if you are configuring via xml this is:

<encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
        <charset>UTF-8</charset>            
        <outputPatternAsHeader>true</outputPatternAsHeader>
        <pattern>[%thread] %-5level %logger{35} - %msg%n</pattern>
</encoder>

Default File Encoding

Logback's documentation is correct in stating that the default character encoding is used. The default character set is not typically UTF-8 on windows (mine is windows-1252 for instance). The correct thing to do it configure logback to be UTF-8 as above. Even if logback is picking UTF-8 up from somewhere, or file.encoding is somehow being set by you, there's no guarentee that this will happen in the future.

Incidentally Sun had previously said about file.encoding, if you are setting this on an Oracle VM:

The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

Eclipse and Maven

If you are running maven from eclipse and you've already set your environment to be UTF-8 either in for the environment/project or the Run Configuration (for me in the common tab) then eclipse will arrange for the new JVM to have UTF-8 encoding by setting file.encoding. See: Eclipse's encoding documentation

andygavin
  • 2,784
  • 22
  • 32
4

The system's default charset is determined by Java and set in the system property file.encoding, but this property can also be specified as the JVM starts up (more in this answer). Eclipse, Netbeans, Maven, etc. can use this system property to set the default charset to UTF-8 and that is probably why output is in UTF-8 even though you did not specify it.

To remove the element of chance, specify the character set for logging as shown in this answer. Logback's source code shows how the character set is used to convert the Strings to bytes to write to file in the convertToBytes method (more on Strings to bytes is explained in this answer).

On Unix, the value for file.encoding is determined using the environment variables (e.g. via LANG=en_US.UTF-8 as explained here, but other environment variables can be involved as well).
On Windows, the default code page is shown with the command chcp. The code page number corresponds with a character set shown in this list. For example, code page 65001 corresponds with UTF-8. The default locale is shown with the command systeminfo | findstr Locale.

In short: once your software leaves your development environment, you cannot assume any specific default character set. Therefore, always specify a character set.

Community
  • 1
  • 1
vanOekel
  • 6,358
  • 1
  • 21
  • 56
  • 1
    Both provided answers were good. In choosing the bounty I had to take into consideration that andygavin provided an answer first; he provided actual code for solving my problem instead of a link; and was the first to point out that my Eclipse+Maven setup could be the thing making my default charset to be UTF-8. I appreciate your feedback and your notes on `getBytes()` were helpful. – Garret Wilson Sep 03 '15 at 15:13
  • 3
    @GarretWilson That sounds fair. Besides, I learned a thing or two along the way and that is always good. – vanOekel Sep 03 '15 at 16:01