52

i am converting a project from Ant to Maven and i'm having problems with a specific unit test which deals with UTF-8 characters. The problem is about the following String:

String l_string = "ČäÁÓý\n€řЖжЦ\n№ЯФКЛ";

The problem is that the unit test fails, because the String is read as the following:

?äÁÓý
€????
?????

The java class is saved as UTF-8 and i also specify the build encoding to UTF-8 in the pom.xml.

Here is an excerpt of my pom.xml:

...

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

...

<build>
<plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.1</version>
        <configuration>
            <source>1.6</source>
            <target>1.6</target>
            <encoding>${project.build.sourceEncoding}</encoding>
        </configuration>
    </plugin>
    <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.4</version>
        <configuration>
            <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
        </configuration>
    </plugin>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-surefire-plugin</artifactId>
      <version>2.15</version>
    </plugin>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-surefire-report-plugin</artifactId>
      <version>2.15</version>
    </plugin>
 </plugins>
</build>

Am i missing something here? It would be great, if someone could help me here.

Update

Regarding the test code:

@Test
public void testTransformation()
{

    String l_string = "ČäÁÓý\n€řЖжЦ\n№ЯФКЛ";
    System.out.println( ">>> " + l_string );
     c_log.info( l_string );
    StringBuffer l_stringBuffer = new StringBuffer();
    int l_stringLength = l_string.length();

    String l_fileName = System.getProperty( "user.dir" ) + File.separator + "transformation" + File.separator + "TransformationMap.properties";
    Transformation.init( l_fileName );

    Properties l_props = Transformation.getProps();
    for ( int i = 0; i < l_stringLength; i++ )
    {
        char l_char = l_string.charAt( i );
        int l_intValue = (int) l_char;
        if ( l_intValue <= 255 )
        {
            l_stringBuffer.append( l_char );
        }
        else
        {
            l_stringBuffer.append( l_props.getProperty( String.valueOf( l_char ), "" ) );
        }
    }
    c_log.info( l_stringBuffer.toString() );
    byte[] l_bytes = l_string.getBytes();
    byte[] l_transformedBytes = Transformation.transform( l_bytes );
    assertNotNull( l_transformedBytes );

}

The following logic isn't really relevant(?) because after the first sysout the before mentioned "?" are printed instead of the correct characters (and therefore the following tests fail). There is also no use of a default platform encoding.

The test converts each character according to the TransformationMap.properties file, which is in the following form (just an excerpt):

Ý=Y
ý=y
Ž=Z
ž=z
°=.
€=EUR

It should be noted that the test runs without any problem when i build the project with Ant.

softandsafe
  • 2,465
  • 3
  • 15
  • 14
  • 1
    What is the test code? Does it use the platform default encoding at any place? Or does the code under test do that somewhere? – Joachim Sauer Jul 15 '13 at 14:27
  • possible duplicate of [Is there a way to make maven build class files with UTF-8 without using the external JAVA\_TOOL\_OPTIONS?](http://stackoverflow.com/questions/10368527/is-there-a-way-to-make-maven-build-class-files-with-utf-8-without-using-the-exte) – Danack Jul 15 '13 at 15:04
  • @Joachim Sauer: I updated my posting. – softandsafe Jul 15 '13 at 15:04
  • @softandsafe: that's not a useful test, because if your output console isn't set to use a unicode encoding, then the output will look wrong, even if `l_string` contains the correct data (i.e. even if it is compiled correctly). Do you have an actual **assert** that fails? Or do you just verify visually if it works? – Joachim Sauer Jul 15 '13 at 15:06
  • @JoachimSauer: I updated my post again. I have an actual assert that fails. – softandsafe Jul 15 '13 at 15:15
  • @Danack: Thank you, but the solutions in the possible duplicate do not change the behavior. – softandsafe Jul 15 '13 at 15:20
  • From the possible duplicate "It is not enough to define that property. You MUST pass it inside the appropriate plugins. It won't go by magic inside there." But you aren't passing the property into the compiler plugin. – Danack Jul 15 '13 at 15:39
  • Looks very much like the java source is in Windows Latin-1 (Cp1252). Test with JEdit or so, try `\u....` as in the answer below. – Joop Eggen Jul 15 '13 at 16:05
  • @Danack: If you mean by "pass it inside the appropriate plugins" to include the "${project.build.sourceEncoding}" tags in the maven compiler plugin and the maven resource plugin: I did that. Oddly enough, if i use the "export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8" (or set instead of export in Windows) line before doing mvn clean install the test runs without any error. – softandsafe Jul 15 '13 at 16:07
  • How about updating the maven config in your question to show the option being set for the plugins? – Danack Jul 15 '13 at 16:29
  • @Danack: Sorry, updated my post now. – softandsafe Jul 16 '13 at 08:05

5 Answers5

147

I have found a "solution" myself:

I had to pass the encoding into the maven-surefire-plugin, but the usual

<encoding>${project.build.sourceEncoding}</encoding>

did not work. I still have no idea why, but when i pass the command line arguments into the plugin, the tests works as they should:

<plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-surefire-plugin</artifactId>
      <version>2.15</version>
      <configuration>
        <argLine>-Dfile.encoding=UTF-8</argLine>
      </configuration>
</plugin>

Thanks for all your responses and additional comments!

softandsafe
  • 2,465
  • 3
  • 15
  • 14
  • That's really odd. The surefire plugin shouldn't care about this at all. Did you use `mvn help:effective-pom` on the broken module to see which options are passed to surefire? – Aaron Digulla Jul 16 '13 at 08:35
  • I did. ` maven-surefire-plugin 2.15 default-test test test -Dfile.encoding=UTF-8 -Dfile.encoding=UTF-8 ` – softandsafe Jul 16 '13 at 09:28
  • :-/ Why are there two `` elements? Is that on Windows, Mac or Linux? – Aaron Digulla Jul 16 '13 at 10:33
  • To be honest, i have no idea. It is on Windows, but i will try to build the project later on Linux. – softandsafe Jul 16 '13 at 11:13
  • 1
    I'm wondering what the default encoding is; probably `cp15xx`. Try this: Remove the `-Dfile.encoding` and print the result of `Charset.defaultCharSet()` in your test. I'm also wondering why it matters; the code is compiled with the compiler plugin; surefire should be independent of the compile step. – Aaron Digulla Jul 16 '13 at 11:36
  • 2
    `windows-1252`. It seems to use the OS default encoding, but the encoding is set everywhere in the pom file to UTF-8 even in the surefire-plugin. – softandsafe Jul 16 '13 at 13:18
  • Ah, now it makes sense. Set `forkMode` to `once`. That should fix it. – Aaron Digulla Jul 16 '13 at 16:16
  • Hm :-/ I just noticed that "once" is the default. Still, the default encoding should not have any impact on compiled Java code. There must be something else that we're missing. – Aaron Digulla Jul 16 '13 at 16:21
  • I also have test cases with unicode, and when I run maven with Java 1.6 I need to set this argument. While with java 1.7 it's not necessary to set. –  Oct 10 '14 at 08:38
  • 17
    Maybe a bit more resilient solution would be `-Dfile.encoding=${project.build.sourceEncoding}` – Rade_303 May 15 '15 at 16:29
  • 5
    This is still open. Issue moved from codehaus to apache at https://issues.apache.org/jira/browse/SUREFIRE-951 – BlueDog Mar 17 '17 at 10:00
  • Saved my day, man! It took me like 2 hours to figure out the root of my problem (tests were running perfectly fine from IntelliJ IDEA, but one test always failed from Maven). The reason was a literal string in Cyrillic was passed properly (as UTF-8) into myBatis mapper when run from IDEA, but wrongly (as Cp-1251) when run with Maven. Adding Surefile plugin configuration into POM file and specifying argline did the trick! – 62mkv Oct 11 '17 at 13:28
  • The [issue](https://issues.apache.org/jira/browse/SUREFIRE-951) is marked as Solved but as the last comment states, it's still occurring. I have it in version `3.0.0-M4`. This fixed it. – Nico Van Belle Aug 19 '20 at 13:50
  • while this does solve the issue for encoding, I'm not getting any code coverage shown anymore when I overwrite the configuration -> argLine – EasterBunnyBugSmasher Aug 25 '21 at 17:40
  • 1
    as just commented: This change breaks other functionality. So you must write @{argLine} -Dfile.encoding=UTF-8 – EasterBunnyBugSmasher Aug 25 '21 at 18:10
  • awesome.. you did an awesome thing.. which is not there anywhere.. Thanks a lot.. saved a lot of time – Gaurav Khurana Feb 17 '23 at 13:52
10
  1. When debugging Unicode problems, make sure you convert everything to ASCII so you can read and understand what is inside of a String without guesswork. This means you should use, for example, StringEscapeUtils from commons-lang3 to turn ä into \u00e4. That way, you can be sure that you see ? because the console can't print it. And you can distinguish " " (\u0020) from " " (\u00a0)

    In the test case, check the escaped version of the inputs as early as possible to make sure the data is actually what you expect.

    So the code above should be:

    assertEquals("\u010d\u00e4\u....", escape(l_string));
    
  2. Make sure you use the correct encoding for file I/O. Never use the default encoding of Java, always use InputStreamReader/OutputStreamWriter and specify the encoding to use.

  3. The POM looks correct. Run mvn with -X to make sure it picks up the correct options and runs the Java compiler using the correct options. mvn help:effective-pom might also help.

  4. Disassemble the class file to check the strings. Java will use ? to denote that it couldn't read something.

    If you get the ? from System.out.println( ">>> " + l_string );, this means the code wasn't compiled with UTF-8 or that the source file was maybe saved with another Unicode encoding (UTF-16 or similar).

    Another source of problems could be the properties file. Make sure it was saved with ISO-8859-1 and that it wasn't modified by the compilation process.

  5. Make sure Maven actually compiles your file. Use mvn clean to force a full-recompile.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
5

I had a really resilient problem of this kind and setting environmental variable

MAVEN_OPTS=-Dfile.encoding=UTF-8

fixed the issue for me.

David Vonka
  • 511
  • 4
  • 14
4

Your problem is not the encoding of the source file (and therefore the String inside your class file) but the Problem is the encoding of System.out's implicite PrintStream. It uses file.encoding which represents the System encoding, and this is in Windows the ANSI codepage.

You would have to set up a PrintWriter with the OEM code page (or you use the class which is intended for this: Console).

See also various bugs around this in: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4153167

eckes
  • 10,103
  • 1
  • 59
  • 71
4

this works for me:

...
 <properties>
        **<project.build.sourceEncoding>ISO-8859-1</project.build.sourceEncoding>
        <project.reporting.outputEncoding>ISO-8859-1</project.reporting.outputEncoding>**
    </properties>
...
  <build>
    <finalName>Project</finalName>

    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>2.3.2</version>
        <configuration>
          <source>1.6</source>
          <target>1.6</target>
          **<encoding>${project.build.sourceEncoding}</encoding>**
        </configuration>
      </plugin>
      <plugin>
        <artifactId>maven-war-plugin</artifactId>
        <version>2.2</version>
        <configuration>
          <warSourceDirectory>WebContent</warSourceDirectory>
        </configuration>
      </plugin>
    </plugins>
  </build>
Eric Martinez
  • 406
  • 5
  • 16