3

I have a java file in Eclipse that is in UTF-8 and has some strings containing accents.

In the java file itself, the accent is written and saved as é . In the xml that is generated using velocity the é becomes é In the pdf that is generated using fop and and an xsl template, the output is displayed as é

So this is probably an encoding issue and everything should be in UTF-8. What's weird is that locally in my eclipse environment (windows) where I run the application, the whole process works and the correct accents é are displayed in the pdf.

However when the application is built with maven and deployed to a (unix environment) I see the problem described above.

SyntaxT3rr0r
  • 27,745
  • 21
  • 87
  • 120
Ayrad
  • 3,996
  • 8
  • 45
  • 86
  • 2
    Not sure about the Maven/Velocity/PDF-generator parts, but this sounds much like as if the transfer to Unix didn't treat textbased files as UTF-8. Open the files in a UTF-8 capable editor in Unix and take a look to exclude one and other. – BalusC Dec 01 '10 at 15:33
  • A minor point but java files are usually either UCS-2 or UTF-16 encoded. – GaryF Dec 01 '10 at 15:45
  • 2
    @GaryF: I think you're a bit confused between *.java* file encoding and the JVM's internal string representation. A *.java* file is just a text file, with no metadata and its encoding depends solely on the editor you use to create the file. For example if I type an 'é' in IntelliJ IDEA (my Java IDE of choice) and save the file, it is going to be saved, by default, as a UTF-8 file. In addition to that, I've got a really hard time remembering the last time I saw a *.java* file encoded as UCS-2. – SyntaxT3rr0r Dec 01 '10 at 15:55
  • In eclipse you can right click on a java file and choose any text file encoding including UTF-8. – Ayrad Dec 01 '10 at 15:57
  • @Ayrad: I've written here that strings containing non-ASCII characters should be externalized to files and not put directly in *.java* file **OR** you'll have a lot of issues, including but not limited to problematic batch/scripting, encoding issues when transfering the file to/from various OSes, IDEs, text editors, etc. Some people have problems understanding this **fact** that said. Out of curiosity, what happens if you use the *\u00E9* escape in your .java source file? – SyntaxT3rr0r Dec 01 '10 at 15:59
  • Good point about externalizing the strings I should probably do that. I will also try with the \u00E9 and report back but it feels like a workaround. – Ayrad Dec 01 '10 at 16:06
  • @Aryad The troublesome **é** in your source code appears in a String or char literal (that is, in quotes, like `"gloph dréusse"` or `'é'`), right? And the Java code is passing it to Velocity without sending it to a file first, or over the network, or anything like that? – Jason Orendorff Dec 01 '10 at 16:12
  • @Webinator - Fair point. I had always (erroneously) assumed that files matched the string encoding. – GaryF Dec 01 '10 at 16:45

1 Answers1

3

Perhaps Eclipse is compiling the file with a different javac command line than Maven.

When you compile Java, you have to tell the compiler the encoding of the source files (if they contain non-ASCII characters and the default doesn't work).

javac -encoding utf8 MyCode.java

I think the way to fix this in Maven is to add this to your pom.xml file:

<project>
  ...
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  ...
</project>

(I got that from a Maven FAQ about a slightly different issue.)

You could instead avoid the encoding issue entirely by using ugly Unicode escape sequences in your Java file. é would become \u00e9. Worse for humans, easier for the toasters. (As Perlis said, "In man-machine symbiosis, it is man who must adjust: The machines can't.")

Jason Orendorff
  • 42,793
  • 6
  • 62
  • 96
  • This doesn't explain why it works locally in Windows. As far I understand, build/compile happens on Windows and then the files are transferred to Unix. Using the unicode escapes is however a good suggestion as workaround/prevention. – BalusC Dec 01 '10 at 15:54
  • @BalusC My theory is that the Eclipse build is different from the Maven build. – Jason Orendorff Dec 01 '10 at 15:57
  • +1... Exactly. Unicode escape sequences and string externalization if needed. – SyntaxT3rr0r Dec 01 '10 at 16:00
  • Ah, I thought that when you're using Maven in Eclipse, that it will then build by Maven. But I shouldn't think about Maven since I don't use it :) You may indeed be right about that point. – BalusC Dec 01 '10 at 16:01
  • @BalusC: I know, I know... Which is precisely why I'm right everytime single time I write that developers putting non-escaped non-ASCII characters in *.java* source file should be shot to death: there are a **LOT** of very weird issues that can crop up. Here we're just talking about Eclipse/Maven. Add a mix of IntelliJ, OS X, Un*xes and batch/scripts to the mix and you'll see why here I made it mandatory to use a custom Ant task that **forces** *.java* files to be ASCII only (build fails if you fail to comply with this) – SyntaxT3rr0r Dec 01 '10 at 16:04
  • Well, the firing squad seems a bit extreme. But even the legendary Jon Skeet says it's easier to use `\u` escapes: http://stackoverflow.com/questions/464874/unmappable-character-for-encoding-warning-in-java/464886#464886 – Jason Orendorff Dec 01 '10 at 16:16
  • 3
    (I blame the designers of the Java language for not specifying what encoding a file uses, and the javac designers for using the platform-specific default encoding. In hindsight, it's obvious that the meaning of a Java source file should have been platform-independent. Oops.) – Jason Orendorff Dec 01 '10 at 16:18
  • @Jason Orendorff: +1 to your comment about responsibilities. However now we're "stuck" with this sad state of affair and escaping (or externalization) is what we're left with :-/ – SyntaxT3rr0r Dec 02 '10 at 14:36