2

Im facing a problem with umlauts in groovy/java on a ubuntu server.

This groovy code return for exists() false for files with umlauts:

def f1 = new File('/var/lib/jenkins/test/')
def files = [:]
f1.listFiles().each {
  files.put(it.name, it.getAbsoluteFile().exists())
}
println files
println 'file.encoding:' + System.getProperty('file.encoding')

Results in:

Verderblichkeit.docx:true
Gefa��hrlichkeit.docx:false
file.encoding:"iso-8859-1"

So it return false for a file it found itself with listFile(). That is wrong.

ls -al in the drirectory:

drwxr-xr-x  2 jenkins jenkins   4096 Jan  5 18:17 .
drwxr-xr-x 66 jenkins jenkins  12288 Jan  5 18:16 ..
-rw-r--r--  1 jenkins jenkins  98035 Jan  5 18:16 Gefährlichkeit.docx
-rw-r--r--  1 jenkins jenkins 277515 Jan  5 18:17 Verderblichkeit.docx

In linux I can copy or mv or rename the files and see the umlauts.

Environment:

  • Version of Java: Java(TM) SE Runtime Environment (build 1.8.0_131-b11)

Note: The original problem is getting the file path from a database. The file can be found and served throug nginx but in the java app (grails with groovy files) I get a false result for File.exists()

What can I do?

I tried setting UTF-8 as file.encoding by setting this in the application environment or by -D param on start. I searched the web but didn't find a solution.

Dirk27
  • 76
  • 4
  • Interestingly, if you copied and pasted from your terminal, Gefährlichkeit.docx has its a umlaut as two separate characters, which can happen sometimes. What, in the terminal, is the output of `ls *rlichkeit.docx`? – g00se Jan 05 '23 at 18:08
  • What if you print name with and without getAbsoluteFile()? 2 bytes means utf-8 but if you still see 2 bytes in terminal - it's wrong. – daggett Jan 05 '23 at 20:56
  • why do you need to check for existence of files, which are returned by file iterating methods? if files are returned, they DO exist by definition... – injecteer Jan 05 '23 at 23:14
  • Thanks for your replies. We use UTF-8 in our dev env and in the database. The app runs in a tomcat directly on a server and in a container. g00se: With ls the file name is printed correct like in the ls -al above as 'Gefährlichkeit.docx'. daggett: The file name inside the app is always printed wrong like above as 'Gefa��hrlichkeit.docx'. injecteer: This is only code showing the core of the problem. The app gets the file name from the database as 'Gefährlichkeit.docx' and java can't find the file then. – Dirk27 Jan 06 '23 at 08:11
  • @g00se: With ls the file name is printed correct like in the ls -al above as 'Gefährlichkeit.docx'. – Dirk27 Jan 06 '23 at 08:35
  • @daggett: The file name inside the app is always printed wrong like above as 'Gefa��hrlichkeit.docx'. – Dirk27 Jan 06 '23 at 08:35
  • @injecteer: This is only code showing the core of the problem. The app gets the file name from the database as 'Gefährlichkeit.docx' and java can't find the file then. – Dirk27 Jan 06 '23 at 08:36
  • if the file name comes from the DB, it should be a URI instead! – injecteer Jan 06 '23 at 10:22
  • Your problem is being caused by Unicode composition. See [java.text.Normalizer](https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/text/Normalizer.html) Perhaps more from me on this later – g00se Jan 06 '23 at 10:45
  • Thanks for your replies. So far I think I can fix the problem by setting sun.jnu.encoding to 'UTF-8' on application startup. In the dev env it works by setting it in the grails 4 app via build.gradle in the bootRun section. I try it now in the tomcat servers and come back later. – Dirk27 Jan 06 '23 at 11:38

2 Answers2

1

Solution

The problem occured in different environments:

  1. development env: grails 4 application startet with gradle bootRun
  2. CI-stage with a tomcat 9 server
  3. production env: tomcat running in a docker container

Short answer: The problem was the wrong settings for sun.jnu.encoding. Solution was to set it in the correct way for each env.

Long answer: We had to set the java system property 'sun.jnu.encoding' in the different envs :

1. dev env

Set system properties in the bootRun section in build.gradle:

bootRun {
    jvmArgs(
        '-Dsun.jnu.encoding=UTF-8',
        '-Dfile.encoding=UTF-8',
        ...)
}

2. tomcat 9 on server

Set system properties in setenv.sh in tomcat/bin:

export JAVA_OPTS="-Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 $JAVA_OPTS"

3. tomcat 9 in docker container in prod env

We used this solution https://stackoverflow.com/a/28406007/14748724. We need to rebuild the container image.

Finally we had to set this in the docker-compose.yaml file:

tomcat:
   environment:
      LC_ALL: 'en_US.UTF-8'

Before it was LC_ALL: 'C', which was wrong.

Note: Using the setenv.sh solution from env 2. didn't work in the container!

Dirk27
  • 76
  • 4
0

This is not an answer as such, but it allows me to show the problem with Unicode composition and file names. Let's create two files with the same name:

goose@t410:/tmp$ touch $(echo -e '\x61\xCC\x88.txt')
goose@t410:/tmp$ touch $(echo -e '\xC3\xA4.txt')
goose@t410:/tmp$ ls *.txt
ä.txt  ä.txt

What!? Hang on, this is a trick isn't it? They are really the same file? Here's proof they are different:

goose@t410:/tmp$ ls -i *.txt

131467 ä.txt 131527 ä.txt

g00se
  • 3,207
  • 2
  • 5
  • 9
  • You are right like I can see. We solved this problem by setting the system properties and locales in the way the env s needed. I posted an answer. – Dirk27 Jan 06 '23 at 16:24
  • I think it can get even weirder. It might be possible to have a file called ä.txt with its name encoded as, say, ISO-8859-1 and have its name read identically if you have more than one locale installed. I say "I think" as I haven't tested this as I don't want to mess with my en_GB.UTF-8 locale. But I *know* I could make a file with that name in ISO-8859-1. My doubt is only whether it would also read correctly – g00se Jan 06 '23 at 16:39