2

I am normalizing some accented text using the following approach / code taken from this answer

Accent removal:

String accented = "árvíztűrő tükörfúrógép";
String normalized = Normalizer.normalize(accented,  Normalizer.Form.NFD);
normalized = normalized.replaceAll("[^\\p{ASCII}]", "");
System.out.println(normalized);

When I run this from with IntelliJ (as part of a unit test), this gives the expected result:

arvizturo tukorfurogep

If I run this from the command line (via gradle), I get:

ArvAztArA tAkArfArAgA

In both cases, I'm using the same PC and Java 1.8.0_151.

The relevant parts from build.gradle:

apply plugin: 'java'
apply plugin: 'idea'
sourceCompatibility = 1.8
targetCompatibility = 1.8
dependencies {
  testCompile group: 'junit', name: 'junit', version: '4.12'
}

What causes this different behaviour? And how do I ensure I get the expected result everywhere?

dave
  • 11,641
  • 5
  • 47
  • 65
  • could you please share your gradle file, because i tried same code using gradle it worked. – arjunsv3691 Nov 10 '17 at 00:59
  • Question updated with `gradle` snippet. – dave Nov 10 '17 at 01:05
  • where's the run task in your gradle ?, you running the code using a task in gradle right like this : task(runui, dependsOn: 'classes', type: JavaExec) { main = 'stockticker.ui.StockTickerDriver' classpath = sourceSets.main.runtimeClasspath } – arjunsv3691 Nov 10 '17 at 01:13
  • On the command line, I type `gradle clean test` – dave Nov 10 '17 at 02:48
  • 1
    Imagree, looks more like a compile Problem not a runtime Problem. Define your source encoding (or use Unicode escapes and ascii only in the source code) – eckes Nov 13 '17 at 00:26
  • I'll try the character encoding at compile time. The actual text is user-supplied, so I will not be able to escape it at compile time. – dave Nov 13 '17 at 00:28
  • OK, looks like we're onto something. In IntelliJ `System.getProperty("file.encoding");` returns `UTF-8`, while on the command line, I get `windows-1252`. Now to figure out an actual solution. – dave Nov 13 '17 at 00:54
  • Your sample code shows a literal, it is subject to the source encoding. When you read a file don’t use the default encoding of FileReader (and show the complete code). – eckes Nov 13 '17 at 05:50
  • I don't actually read a file at any point in the real process. There's a string literal in the test code. In production, it's a string received from a browser request (hence the need to sanitise it). – dave Nov 13 '17 at 06:29

1 Answers1

1

Thanks to @eckes and others for the compile time suggestion. By specifying an encoding at compile time, I was able to get the desired result.

The setting I added to build.gradle was:

compileTestJava.options.encoding = 'UTF-8'

This option only affects the test classes (which is where my issue was). You can also use:

compileJava.options.encoding = 'UTF-8'

if you have text in your production code that needs to be encoded.

An alternative solution I came across is:

tasks.withType(JavaCompile) {
  options.encoding = 'UTF-8'
}

(Interestingly, none of the above solutions changed the value of the file.encoding system property.)

dave
  • 11,641
  • 5
  • 47
  • 65