0

I'm once again messing around with the java natve interface, and I've runned into another interesting problem. I'm sending a filepath to c via jni and then doing some I/O. So the most common chars I have troubles with is 'äåö'. Here is a short demo of a program with the exact same problem:

Java:

public class java {

  private static native void printBytes(String text);
  static{
    System.loadLibrary("dll");
  }

  public static void main(String[] args){
    printBytes("C:/Users/ä-å-ö/Documents/Bla.txt");
  }
}

C:

#include "java.h"
#include <jni.h>

JNIEXPORT void JNICALL Java_java_printBytes(JNIEnv *env, jclass class, jstring text){
  const jbyte* text_input = (*env)->GetStringUTFChars(env, text, 0);
  jsize size = (*env)->GetStringUTFLength(env, text);
  int i = 0;
  printf("%s\n",text_input);
  (*env)->ReleaseStringUTFChars(env, text, text_input);
}

Output: C:/Users/├ñ-├Ñ-├Â/Documents/Bla.txt

This is NOT my desired result, I would like it to output the same string as in java.

Linus
  • 1,516
  • 17
  • 35

1 Answers1

3

You are dealing with platform specific character encoding issues. Although the standard c printf should be able to handle multibyte (utf-8) encoded strings the windows/msvc provided one is anything but standard and cannot. On a non-windows standard conforming platform would expect your code would work. The string coming from java is in UTF-8 (multibyte char) and the MS printf is expecting a ASCII (single byte per char). This is working for ASCII characters because in UTF-8 those characters have the same value. It does not work for characters outside of ASCII.

Basically you need to either convert your string to wide characters (text.getBytes(Charset.forName(UTF-16LE"))) and pass it as an array from java to c or convert the multibyte string to wide characters in c after receiving it (MultiByteToWideChar(CP_UTF8, ...)). Then you can use printf("%S") or wprintf("%s") to output it.

See Printing UTF-8 strings with printf - wide vs. multibyte string literals for more information. Also note that the answer says you have to set unicode output mode with _setmode if you want unicode output on the windows console.

Also note that I don't believe GetStringUTFLength guarantees a NUL terminator but it's been too long.

Graham
  • 86
  • 1
  • Thank you Graham, this is a very nice and simple explanation, but I'm using a external library in the my real program. And it doesn't accept w_char unfortunently.. Is there a way to apply this to a ordinary char array in c? It would've absoluetly been awesome. -Cheers – Linus Feb 27 '14 at 15:02
  • Igg. Um, without knowing the details of the library it's hard to say. Depending on your language needs you might be able to get away with [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1). Try it in place of UTF-16LE above. – Graham Feb 27 '14 at 15:59
  • I'm using a library called MatIO, but how do I properly use the MultiByteToWideChar function? I saw [this](http://stackoverflow.com/a/3999597/3013334) post about it, but it was using c++, thanks! – Linus Feb 27 '14 at 16:22
  • Pass `CP_UTF8` and `MB_ERR_INVALID_CHARS` as the first and second parameters to [MultiByteToWideChar](http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx). The rest is buffer management. I'd suggest doing it in java though as then you don't have to muck with buffer lengths and allocations. – Graham Feb 27 '14 at 18:00
  • Thanks, I tried this short code: `int size_needed=MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, (char*) bytes, -1, NULL, 0); char[size_needed] str; WideCharToMultiByte(CP_UTF8, MB_ERR_INVALID_CHARS, (char*) bytes, -1, str, size_needed, NULL, NULL);` But it doesn't compile properly. – Linus Feb 27 '14 at 18:21
  • I noticed I had changed the charset from UTF-16LE to ISO-8859-1 and when I tried changing it back, it compiled completely fine. Although the output is INDEED incorrect. All it does is that it prints a few weird characters. – Linus Feb 27 '14 at 18:57