15

My Java program reads the contents of a directory recursively. This is a sample tree (note the non-ASCII characters):

./sviluppo
./sviluppo/ciaò
./sviluppo/ciaò/subdir
./sviluppo/pippo
./sviluppo/pippo/prova2.txt <-file
./sviluppo/così

The program is started as an Upstart service, with a configuration file named like /init/myservice.conf

description "Private Service"
author "AD"
start on runlevel [2345]
stop on runlevel [! 2345]
exec java -jar /home/mainFind.jar >> /tmp/log.txt

When I launch the service:

root@mdr:/tmp#  service myservice start
myservice start/running, process 15344

it doesn't log filenames with non-ASCII characters in the name:

root@mdr:/tmp#  cat /tmp/log.txt
Found dir: /mnt/sviluppo/pippo

Instead, when I run the command (as root, to mimic what happens when it's started as a service) it works fine, with and without exec:

root@mdr:/tmp# java -jar /home/mainFind.jar  >> /tmp/log.txt
root@mdr:/tmp# exec java -jar /home/mainFind.jar  >> /tmp/log.txt

root@mdr:/tmp#  cat /tmp/log.txt
Found dir: /mnt/sviluppo/ciaò
Found dir: /mnt/sviluppo/ciaò/subdir
Found dir: /mnt/sviluppo/pippo
Found dir: /mnt/sviluppo/così

Why the same program run by the same user doesn't work in an Upstart service, but correctly processes all of the filenames when run from the command line? Here is the Java code

public static void aggiungiFileDir(File f){
  File[] lista= f.listFiles();
  for(int i=0;i<lista.length;i++){
    if(lista[i].isDirectory()){
      System.out.println("Found dir: "+lista[i]); 
    }
  }
}

Where the formal parameter f is the root dir. The function will be called recursively on each subdir.

EDIT 2: Post ls

root@mdr:/tmp# ls -al /mnt/sviluppo
totale 20
drwx------ 5 root root 4096 nov 15 15:10 .
drwxr-xr-x 7 root root 4096 nov  9 10:43 ..
drwxr-xr-x 2 root root 4096 nov 15 15:10 ciaò
drwxr-xr-x 2 root root 4096 nov 15 11:23 così
drwxr-xr-x 2 root root 4096 nov 15 17:57 pippo
Raffaele
  • 20,627
  • 6
  • 47
  • 86
Andrea
  • 265
  • 1
  • 3
  • 13
  • 2
    Without the code, one can't tell... – Raffaele Nov 16 '12 at 12:43
  • The java code? I have not posted because I did not think it was for. If you launch it from the command line works, as no service, the problem I think is linux, not java. I have edited whit code. – Andrea Nov 16 '12 at 14:09
  • Edit and do it, but permission are OK – Andrea Nov 16 '12 at 15:34
  • Note that the permissions seem not ok. The current directory doesn't have allow any operation to unprivileged users... Anyway, what's the problem? Can you attach a JUnit test case? Or at least compare the actual result with the expected one? From the description, it's not clear what happens here... – Raffaele Nov 16 '12 at 15:35
  • The problem is that if I run the program from the command line (exec java ecc) it returns the list of all directories right. If I run "service myservice start", where myservice.conf on /etc/init/ has the code posted above, it does not find the directory with an accent! I write all permission are ok because all 2 operation are execute by root. – Andrea Nov 16 '12 at 15:40
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/19645/discussion-between-raffaele-and-andrea) – Raffaele Nov 16 '12 at 15:44

2 Answers2

21

Java uses a native call to list the contents of a directory. The underlying C runtime relies on the locale concept to build Java Strings from the byte blob stored by the filesystem as the filename.

When you execute a Java program from a shell (either as a privileged user or an unprivileged one) it carries an environment made of variables. The variable LANG is read to transcode the stream of bytes to a Java String, and by default on Ubuntu it's associated to the UTF-8 encoding.

Note that a process need not to be run from any shell, but looking at the code it seems that Upstart is smart enough to understand when the command in the configuration file is intended to be executed from a shell. So, assuming that the JVM is invoked through a shell, the problem is that the variable LANG is not set, so the C runtime assumes a default charset, which happens to not be UTF-8. The solution is in the Upstart stanza:

description "List UTF-8 encoded filenames"
author "Raffaele Sgarro"
env LANG=en_US.UTF-8
script
  cd /workspace
  java -jar list.jar test > log.txt
end script

I used en_US.UTF-8 as the locale, but any UTF-8 backed one will do just as well. The sources of the test list.jar

public static void main(String[] args) {
    for (File file : new File(args[0]).listFiles()) {
        System.out.println(file.getName());
    }
}

The directory /workspace/test contains filenames like ààà, èèè and so on. Now you can move to the database part ;)

Raffaele
  • 20,627
  • 6
  • 47
  • 86
-1

Adding this to /etc/init.d/script fixed this issue for me (I copied it from /etc/init.d/tomcat7):

# Make sure script is started with system locale
if [ -r /etc/default/locale ]; then
    . /etc/default/locale
    export LANG
fi

Contents of /etc/default/locale on my machine:

LANGUAGE=en_US:en
LANG=en_US.UTF-8
minisaurus
  • 1,099
  • 4
  • 17
  • 30