5

I am writing a Netbeans java program to handle Chinese characters in arguments. I am able to use unicode codes and Chinese characters in my coding and they compiles and display on console correctly. However, when I pass the Chinese characters via project properties run arguments, they all turn into ?????. I have already set my project encoding to UTF-8 and VM options -Dfile.encoding=UTF-8. Here is my coding, please help.

public static void main(String[] args) {
    String test = "\u5973\u58eb";
    System.out.println(test);      //works
    String test2 = "女士2";
    System.out.println(test2);     //works
    System.out.println(args[0]);   //copy&paste test2 to argument, not works, showing ??
}

NB: I use jdk 1.8.0_202, Ant v1.10.4, Windows 10, Netbeans 10, fonts used for Netbeans output and termianl are monospaced14, project is created using Netbeans "java application". Problem occurs when run within Netbeans or on Windows command prompt chcp 65001 both with -Dfile.encoding=UTF-8.

NB: With skomisa's advise, I further test with Windows "use unicode utf8 for worldwide language support" checked. I also move away from Netbeans but use the jar to run on Windows cmd with chcp 65001.

D:\.......>java -cp dist/myjar.jar -Dfile.encoding=UTF8 java.mypackage.TestUTF8 女士
女士
女士2
女士

D:\.......>java -cp dist/myjar.jar java.mypackage.TestUTF8 女士
??
??2
女士

So with -Dfile.encoding=UTF8 (first run), constant string in coding works while arguments not. Without -Dfile.encoding (second run), in-code constants not work while arguments works. But I need both. I have Chinese constant string in my program as well as in program arguments. Can somebody tell me what can be done please.

senderj
  • 400
  • 1
  • 9
  • 1
    [1] How are you creating your project: with Ant, Maven or Gradle? That might be relevant. [2] I reproduced your problem (using NB 12.6 + JDK 11+ Ant), but have no solution. [3] See SO question [Java, Unicode, UTF-8, and Windows Command Prompt](https://stackoverflow.com/q/11927518/2985643). Arguably your question is a duplicate of that, but since it was raised over 9 years ago, and was not satisfactorily resolved, I think it's worth revisiting the issue... – skomisa Jan 01 '22 at 06:53
  • 1
    ... [4] That question prompted someone to create Java bug report [JDK-8124977 cmdline encoding challenges on Windows](https://bugs.openjdk.java.net/browse/JDK-8124977). That bug report is still unresolved (!!!), so I doubt if there is a true solution for your problem, but there may be some workaround(s). [5] Please update your question with your versions of Java and NetBeans, your O/S, details on how you created the project, and the font you are using to render the Chinese characters. This will allow others to more closely replicate your environment. – skomisa Jan 01 '22 at 06:56
  • Thank you skomisa for the links. My case is different. The link mentioned the problem on file encoding, Mine is on the arguments encoding, even with chcp 65001. Using the same example as in the link "java -jar Read.jar file.txt", my problem is "file.txt" is in Chinese (not the file contents but the file name). Reading utf-8 file contents has no problem for me with the encoding properties set correctly, but Chinese in file name stops me. – senderj Jan 01 '22 at 22:45
  • Right - the linked question is not an exact duplicate of your question, but the accepted answer to that linked question stated _"you can't reliably use the Windows console for Unicode I/O"_. That answer was from 2012, but the fact that the Java bug I linked to remains unresolved means that your core problem is effectively the same as for the linked question: Unicode input from the Windows command line is broken. Hence my suggestion that it is arguably a duplicate. That said, I'm wondering if there might be a workaround [using JNA](https://github.com/java-native-access/jna#readme). – skomisa Jan 01 '22 at 23:00
  • Thank you skomisa, I've read and understand the second link. Though it hinted me to gain some progress, it create another problem. See my second EDIT. – senderj Jan 02 '22 at 07:01
  • OK. [1] Setting `-Dfile.encoding=UTF-8` won't help because it is ignored. See [JEP 400: UTF-8 by Default](https://openjdk.java.net/jeps/400) which states _"Developers sometimes attempt to configure the default charset by setting the system property file.encoding on the command line (i.e., java -Dfile.encoding=...), **but this has never been supported**"_!!! [2] The JEP describes how UTF-8 will become Java's default charset in JDK 18, but even that won't solve your problem because one of the goals is to _"Standardize on UTF-8 throughout the standard Java APIs, **except for console I/O**"_. – skomisa Jan 02 '22 at 08:20
  • [1] Ignore my previous comment. I have found that setting `-Dfile.encoding=UTF-8` is essential when running from the command prompt. My apologies for misleading you. [2] I have posted an answer that solves the issue using JNA. – skomisa Jan 04 '22 at 01:44

1 Answers1

1

Based on answers to similar questions on SO, it seems that passing Unicode arguments to a Java application has never worked properly. There is no simple solution, but you can resolve this issue using JNA (Java Native Access).

JNA allows you to invoke Windows API methods from Java, without using native code. So in your Java application you can call Win API methods such as GetCommandLineW() and CommandLineToArgvW() directly, to access details about the command line used to invoke your program, including any arguments passed. Both of those methods support Unicode.

The code to do this is not trivial, but not overly complex either. The approach below is based on code by Sergey Karpushin in an answer to Passing command line unicode argument to Java code

For the code to compile you will need a couple of jars: jna.jar and jna-platform.jar. You can get these from the dist directory of the JNA 5.10.0 download, or from Maven.

This approach works both within NetBeans and from the command line on Windows 10, though there are some notable differences:

  • From the command line you must call chcp 65001, and also specify -Dfile.encoding=UTF-8 in your java.exe call.
  • When extracting the parameters returned by CommandLineToArgvW() you may see a difference between the arguments returned within NetBeans and those from the command line. But this is not really an issue since the only argument(s) you are interested in are those at the end, which come after the argument containing your jar file name.

Here's the code:

package chinesearg;

import com.sun.jna.Native;
import com.sun.jna.Pointer;
import com.sun.jna.WString;
import com.sun.jna.ptr.IntByReference;
import com.sun.jna.win32.StdCallLibrary;
import java.util.ArrayList;
import java.util.List;

// Proof of concept application which uses JNA to correctly process command
// line arguments containing Chinese characters using JNA. 
//
// Credit to Sergey Karpushin for the approach used in this this code.
// See this SO answer: https://stackoverflow.com/a/41923480/2985643
public class ChineseArg {

    private final Kernel32 kernel32 = Native.load("kernel32", Kernel32.class);
    private final Shell32 shell32 = Native.load("shell32", Shell32.class);

    public static void main(String[] args) {

        String test = "\u5973\u58eb";
        System.out.println(test);      //works
        String test2 = "女士2";
        System.out.println(test2);     //works
        System.out.println("args.length=" + args.length);
        for (int i=0; i< args.length; i++) {
            System.out.println("args[" + i + "] = "+args[i]);
        }
        String[] params = new ChineseArg().getCommandLineArguments();
        if (params == null) {
            System.out.println("getCommandLineArguments() returned null.");
        } else {
            int count = params.length;
            System.out.println("Number of params=" + count);
            for (int i = 0; i < count; i++) {
                System.out.println("params[" + i + "]=" + params[i]);
            }
        }
    }

    private String[] getCommandLineArguments() {

        System.out.println("Active code page is " + Kernel32.INSTANCE.GetConsoleCP());
        String[] ret = getFullCommandLine();
        List<String> argsOnly = null;

        for (int i = 0; i < ret.length; i++) {
            if (argsOnly != null) {
                argsOnly.add(ret[i]);
            } else if (ret[i].toLowerCase().endsWith(".jar")) {
                argsOnly = new ArrayList<>();
            }
        }
        if (argsOnly != null) {
            ret = argsOnly.toArray(new String[0]);
        }
        return ret;
    }

    private String[] getFullCommandLine() {

        IntByReference argc = new IntByReference();
        Pointer argv_ptr = shell32.CommandLineToArgvW(kernel32.GetCommandLineW(), argc);
        String[] argv = argv_ptr.getWideStringArray(0, argc.getValue());
        kernel32.LocalFree(argv_ptr);
        return argv;
    }
}

interface Kernel32 extends StdCallLibrary {
    static Kernel32 INSTANCE = Native.load("kernel32", Kernel32.class, com.sun.jna.win32.W32APIOptions.DEFAULT_OPTIONS);
    WString GetCommandLineW();
    int GetConsoleCP();
    Pointer LocalFree(Pointer pointer);
}

interface Shell32 extends StdCallLibrary {
    Pointer CommandLineToArgvW(WString command_line, IntByReference argc);
}

This is sample output when run from the Command Prompt, showing that the first argument ("女士2") is captured correctly:

C:\Users\johndoe>chcp 65001
Active code page: 65001

C:\Users\johndoe>java -Dfile.encoding=UTF-8 -jar "D:\NB126\ChineseArg\dist\ChineseArg.jar" "女士2"  "\u5973\u58eb"
女士
女士2
args.length=2
args[0] = ??2
args[1] = \u5973\u58eb
Active code page is 65001
Number of params=2
params[0]=女士2
params[1]=\u5973\u58eb

C:\Users\johndoe>

Notes:

  • This code is addressing a limitation in the Windows environment. I don't know what would happen if this code was run on macOS or Linux.
  • Although it's against the spirit of your question, there is an alternative approach: pass arguments to the application as escaped Unicode. It's trivial to unescape the data using Apache's StringEscapeUtils.unescapeJava(). If that is feasible there is no need for JNA at all.
skomisa
  • 16,436
  • 7
  • 61
  • 102