0

I'm trying to pass some strings as arguments to a .jar file which I'm executing using the command line in linux debian. Part of the strings are extended ascii chars like copyright symbol or the letter ü.

java -jar someJar_CL.jar arg1 arg2 'Lizenziert für foo © foobar' 

Under windows using the powershell everything works just fine. The .jar file gets executed as expected. In linux nonetheless I get the following exception:

java.lang.IllegalArgumentException: U+FFFD ('.notdef') is not available in this font Helvetica encoding: WinAnsiEncoding
        at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:426)
        at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:342)
        at org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:373)
        at watermark.app.AddWatermarkToFile.watermarkPdf(AddWatermarkToFile.java:101)
        at watermark.app.AddWatermarkToFile.watermarkPdfs(AddWatermarkToFile.java:51)
        at watermark.gui.BatchWatermarkPDFFile.main(BatchWatermarkPDFFile.java:113)

In my understanding, this exception means that the program has problems regarding the extended-ascii chars. If I remove them, it is executed correctly (in linux).

I have no direct access to the source code of the .jar file but I don't think it's necessary since it is executed correctly under windows (it's all in jre no matter what OS).

I didn't think it would be the solution but I have installed the ms fonts with apt-get install msttcorefonts. It didn't change anything.

How can I fix this issue? Does it have anything to do with the Helvetica font? Would it work with a different font in linux? It is possible for me to contact the developer of the .jar to ask for changes, but only if it is really necessary.

Thanks in advance.

dombg
  • 311
  • 3
  • 18
  • 1
    Interpretation of the bytes of the command line usually depends on the locale. What locale is set up? Just execute `locale` on your command line to see. On a modern Linux you usually want something that ends in `UTF-8` for maximum compatibility. And as a minor pet-peeve of mine: please get rid of the term "extended ASCII" from your vocabulary, because it's *very misleading* at best and usually just *plain wrong*. "non-ASCII characters" is usually much more accurate and less wrong ;-) – Joachim Sauer Feb 24 '20 at 17:09
  • 2
    @RealSkeptic: note that the character PdfBox complains about is U+FFFD, which is the Unicode Replacement character, usually used to indicate encoding/decoding errors. That indicates that at some earlier step some byte where not correctly interpreted, so I don't think this is related to fonts. – Joachim Sauer Feb 24 '20 at 17:15
  • @JoachimSauer: Thanks for your reply. I think I'm currently using the default values? Here is the result for locale: LANG= LANGUAGE= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL= And I even wrote non-ascii at first but saw on various posts that the copyright symbol was listed under extended-ascii chars. I'm not sure why this is supposed to be wrong, but that's not really the point^^ – dombg Feb 25 '20 at 09:14
  • @JoachimSauer: After changing the locale to en_US-UTF-8 it worked fine. But I'm not sure if I can do this change on the live server. I read it could change the behaviour of some programs, is that real a problem? – dombg Feb 25 '20 at 09:58
  • it should be sufficient to change the environment variables for just the single invocation of your Java application and that shouldn't influence anything else on that system. Alternatively you can go even more granular and [set `sun.jnu.encoding`](https://stackoverflow.com/questions/1066845/what-exactly-is-sun-jnu-encoding) which seems to influence how command line arguments are decoded. – Joachim Sauer Feb 25 '20 at 10:19
  • 1
    @JoachimSauer: Perfect, thank you! Everything works fine now. If you formulate an answer with the given information I will accept it. – dombg Feb 25 '20 at 10:35

1 Answers1

1

Since PdfBox complains about U+FFFD (the Unicode replacement character), we can savely say that something went wrong before the String was given to the PdfBox library.

The issue seems to be how Java interprets the bytes coming in via the command line (the parameters). On Linux it will use the locale information to find out how to interpret command line parameters (which the OS just provides as a un-annotated byte strings with no indication as to their encoding).

If you don't have a locale configured then it could fall back to the POSIX locale and use ASCII encoding. You can fix this in one of two ways

  1. set up your locale (most directly the LANG environment variable) to a locale that uses UTF-8 encoding.

    You can either do this globally or just for the single invocation of java.

  2. set the sun.jnu.encoding system property to explicitly tell Java how to decode command line arguments.

    This option seems to be poorly documented and not standardized, so it might not work with non-Oracle VMs.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614