Why does my text normalization behave differently in different environments?

Question

I am normalizing some accented text using the following approach / code taken from this answer

Accent removal:

String accented = "árvíztűrő tükörfúrógép";
String normalized = Normalizer.normalize(accented,  Normalizer.Form.NFD);
normalized = normalized.replaceAll("[^\\p{ASCII}]", "");
System.out.println(normalized);

When I run this from with IntelliJ (as part of a unit test), this gives the expected result:

arvizturo tukorfurogep

If I run this from the command line (via gradle), I get:

ArvAztArA tAkArfArAgA

In both cases, I'm using the same PC and Java 1.8.0_151.

The relevant parts from build.gradle:

apply plugin: 'java'
apply plugin: 'idea'
sourceCompatibility = 1.8
targetCompatibility = 1.8
dependencies {
  testCompile group: 'junit', name: 'junit', version: '4.12'
}

What causes this different behaviour? And how do I ensure I get the expected result everywhere?

could you please share your gradle file, because i tried same code using gradle it worked. — arjunsv3691, Nov 10 '17 at 00:59
where's the run task in your gradle ?, you running the code using a task in gradle right like this : task(runui, dependsOn: 'classes', type: JavaExec) { main = 'stockticker.ui.StockTickerDriver' classpath = sourceSets.main.runtimeClasspath } — arjunsv3691, Nov 10 '17 at 01:13
Imagree, looks more like a compile Problem not a runtime Problem. Define your source encoding (or use Unicode escapes and ascii only in the source code) — eckes, Nov 13 '17 at 00:26
I'll try the character encoding at compile time. The actual text is user-supplied, so I will not be able to escape it at compile time. — dave, Nov 13 '17 at 00:28
OK, looks like we're onto something. In IntelliJ `System.getProperty("file.encoding");` returns `UTF-8`, while on the command line, I get `windows-1252`. Now to figure out an actual solution. — dave, Nov 13 '17 at 00:54
Your sample code shows a literal, it is subject to the source encoding. When you read a file don’t use the default encoding of FileReader (and show the complete code). — eckes, Nov 13 '17 at 05:50
I don't actually read a file at any point in the real process. There's a string literal in the test code. In production, it's a string received from a browser request (hence the need to sanitise it). — dave, Nov 13 '17 at 06:29

score 1 · Accepted Answer · answered Nov 13 '17 at 01:10

Thanks to @eckes and others for the compile time suggestion. By specifying an encoding at compile time, I was able to get the desired result.

The setting I added to build.gradle was:

compileTestJava.options.encoding = 'UTF-8'

This option only affects the test classes (which is where my issue was). You can also use:

compileJava.options.encoding = 'UTF-8'

if you have text in your production code that needs to be encoded.

An alternative solution I came across is:

tasks.withType(JavaCompile) {
  options.encoding = 'UTF-8'
}

(Interestingly, none of the above solutions changed the value of the file.encoding system property.)

Why does my text normalization behave differently in different environments?

1 Answers1