0

I am trying to pass a UTF-8 string as a command line parameter from PHP to a Java program. When I view the string in the PHP debugger, it show correctly: Présentation

Yet when I look at the arg[0] data in the Java debugger (and the returned value passed back to the PHP program) I see: Pr??sentation

I have tried the Java code below and neither ISO_8859_1 nor UTF_8 return the proper results.

I've looked here on stackoverflow (Translate UTF-8 character encoding function from PHP to Java) as well as other sites and still cannot make sense at what I am doing wrong.

Everything seems to work find in PHP yet Java is doing something right from the start with the data that looks like it needs perhaps additional processing after or before I call the code below.

This is my first go at dealing with international characters. Any help is greatly appreciated. Thank you!

Edit: I am debugging on Windows remotely - the PHP and Java are being run on an Ubuntu system. But since the PHP code and Java code called from the PHP code reside on the linux based system, there should not be any issue with Windows command line Java and UTF-8. I had read here on stackoverflow that was an issue for some in the recent past.

        byte[] test_str_1 = args[0].getBytes(StandardCharsets.ISO_8859_1);
        System.out.println(test_str_1);
        byte[] test_str_2 = args[0].getBytes(StandardCharsets.UTF_8);
        System.out.println(test_str_2);
Michael
  • 67
  • 5
  • 1
    A more common approach is to have the Java application running a web API of some kind and avoid shell invokes given there are a lot of security pitfalls that could make your system really vulnerable. At the least though I think we'd need to see the PHP - the code snippet there isn't very helpful :) – Luke Briggs Jul 19 '21 at 02:57
  • “Yet when I look at” and “I am debugging on Windows remotely”. Giant red flags there. Many terminals, especially Windows, have a hard time with encodings. Make sure you actually have a problem before you fix it. – Chris Haas Jul 19 '21 at 03:10
  • Thank you for the input. Firstly, I should mention that this is for a private Intranet application that is limited in scope. There is no access from the web the web and security is not an issue. The PHP code, is not in issue and could be any calling application (in this case, the string sent is indeed a UTF-8 string as presented above). I can cut and paste that string into the Linux command line and I get the same results. The only missing code is where I receive the arg[0] string and that is why it is why I only included the relevant code above. – Michael Jul 19 '21 at 04:33
  • Regarding the Windows application - at first I did think this may be the problem. But the string is fine in the PHP debugger and both that and IntelliJ are similarly designed and run together on the same machine. So (and just a guess) I would think both JetBrains programs would work same (indeed, both work well in hitting breakpoints between the two programs as well as show other variables properly). Seems that is is something to do with how to run a conversion once in Java. – Michael Jul 19 '21 at 04:37
  • what locale is your Ubuntu machine set to? It may be the locale of your shell that is causing the encoding issues. One option is to write to stdin which appears as a raw byte stream in Java and you can explicitly use a utf8 reader on that instead. – Luke Briggs Jul 19 '21 at 04:41
  • When I run locale it returns "LANG=en_US.UTF-8" – Michael Jul 19 '21 at 04:46
  • My fix may be to use a temp file, but I would like to keep to command line parms if at all possible due to interoperability of future programs. So restructuring the input method would be a fallback last resort. I am hoping that there is an extra step I am missing in bringing the data in from the command line that does not split the char é into the two ?? chars. It appears that it is expanding the é into two bytes of Unicode (which, from my basic understanding of Unicode would be correct). – Michael Jul 19 '21 at 04:56
  • >Yet when I look at the arg[0] data in the Java debugger (and the returned value passed back to the PHP program) I see: Pr??sentation< Has that debugging evironment got its source encoding set to UTF-8 (assuming what's going in IS actually UTF-8)? – g00se Jul 19 '21 at 09:10
  • The data that is output from the Java program and stored on the linux host is the same: Pr??sentation. So the debugger is totally out of the loop at that point since the stored data is not sent from the host. A copy is returned to the calling PHP program and that is where I was checking in the debugger. The only time it is correct is when the caller send the command string to the Java code. The help here on stackoverflow had got be to thinking - so I checked PHP locale and mbstring settings and all are default UTF-8. So far all great tips from the users here. Tx! – Michael Jul 19 '21 at 13:41

1 Answers1

0

The problem has been solved using the solution provided here:

Unicode to PHP exec

Everyone's help got me on the right track. It was indeed a locale issue, but not at the OS level. Instead it was with PHP's locale.

Another user had a similar issue and it was fixed with by adding the following code to the PHP script before executing the command line that calls the Java program:

$locale = 'en_US.utf-8';
setlocale(LC_ALL, $locale);
putenv('LC_ALL='.$locale);

So now, in the Java code, when I view the args[0] param, that is now displayed correctly and also the processed text stored in a file and then sent back to and received into the PHP script properly. It took a bit of looking up the byte values, corresponding UTF-8 encodings, and the like before I could start to see the issue was that PHP was translating what was a correct string just before exec, into a different string during the exec() call. During this call the UTF-8 \0xc3 0xa9 bytes for "é" (Unicode \u00E9) into \3f \3f (two ASCII question mark chars).

During my searching here on stackoverflow I saw a warning not the use literals (e.g. "Présentation") and once I backtracked the data to the caller it became evident that the issue involved the actual call to exec().

Hopefully another new to Unicode processing can benefit from this information.

Thanks for everyone's input which pointed me in the right direction.

Michael
  • 67
  • 5