How to convert the "Java modified UTF-8" to the regular UTF-8 and back?

Question

I have created a Java wrapper around a native C library and have a question about the string encodings. There are slight differences in the “Java modified UTF-8” encoding that is used by Java from the regular UTF-8. And these differences may cause serious problems: the JNI functions may crash the app when passed the regular UTF-8 because it may contain byte sequences forbidden for the “Java modified UTF-8”. Please see the following topic: What does it mean to say "Java Modified UTF-8 Encoding"?

My question is what is a standard reliable way to convert the “Java modified UTF-8” to the regular UTF-8 and back?

if you can get the bytes, there's an answer here: https://stackoverflow.com/questions/655891/converting-utf-8-to-iso-8859-1-in-java-how-to-keep-it-as-single-byte?rq=1 — pcalkins, Aug 08 '19 at 20:06
Don't use `DataInput`, it is not intended for cross-language data exchange. Your question is an [XY problem](https://meta.stackexchange.com/q/66377/351454), i.e. you've decided to use `DataInput` for reading a string from a C library, and that is not the right choice. Go back a step, describe your real problem, i.e. what is the format of the string data coming from C, then ask: How do I read that in Java? — Andreas, Aug 08 '19 at 20:06
ICU can do it, but it'll have to pass through UTF-16 in the middle. [Documentation](http://icu-project.org/apiref/icu4c/ustring_8h.html#aef59ec61e141905bf7b5970ae21f5dd2). — Shawn, Aug 08 '19 at 20:10
Note that although Java's modified UTF-8 is *almost* identical to UTF-8 for characters assigned to Unicode's basic multilingual plane, it is quite different for other characters. The differences can be expressed very simply, but the actual encoded data for non-BMP characters is completely different between the two. — John Bollinger, Aug 08 '19 at 21:01
Essentially any solution has to pass through UTF-16 (at least in some sense) internally since the "Java-modified" thing is just encoding UTF-16 code units in a UTF-8-like encoding (plus misencoding NUL). — R.. GitHub STOP HELPING ICE, Aug 08 '19 at 21:01

score 4 · Answer 1 · answered Aug 08 '19 at 21:33

My question is what is a standard reliable way to convert the “Java modified UTF-8” to the regular UTF-8 and back?

First, consider whether you really need or want to do that. The only reason I can think of for doing so in the context of wrapping a C library is to use the JNI functions that work with Java Strings in terms of byte arrays encoded in modified UTF-8, but that's neither the only nor the best way to proceed except in rather specific circumstances.

For most cases, I would recommend going directly from UTF-8 to String objects, and getting Java to do most of that work. Simple tools Java provides for that include the constructor String(byte[], String), which initializes a String with data whose encoding you specify, and String.getBytes(String), which gives you the string's character data in the encoding of your choice. Both of these are limited to encodings known to the JVM, but UTF-8 is guaranteed to be among those. You can use those directly from your JNI code, or provide suitable for-purpose wrapper methods for your JNI code to invoke.

If you really do want the modified UTF-8 form for its own sake, then your JNI code can obtain it from the corresponding Java string (obtained as summarized above) via the GetStringUTFChars JNI function, and you can go the other way with NewStringUTF. Of course, this makes Java Strings the intermediate form, which is entirely appropos in this case.

score 2 · Answer 2 · answered Aug 09 '19 at 00:34

Thanks everyone for your replies! I finally found the answer. The only documented way of such conversions is using InputStreamReader and OutputStreamWriter

In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter (if it is the platform's default character set or as requested by the program).

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Also the NewStringUTF JNI method expects the Modified UTF-8 input, not the standard one. And it will crash the app if it receives a forbidden byte sequence and the JNI Exception handling can't prevent it from crashing the app.

So my second conclusion is that passing String/jstring from JNI to Java or the other way is always a bad idea. Never do that. Perform all of the conversions with the InputStreamReader and OutputStreamWriter on the Java layer and pass the raw byte arrays to/from the JNI.

Mike Nakis · Answer 3 · 2021-11-10T09:12:57.510

There is absolutely nothing that can only be achieved by using some library. You can always do it yourself.

^{Note: class Buffer below just wraps an array of byte the same way a String wraps an array of char.}

public static String stringFromBuffer( Buffer buffer )
{
    String result = stringFromBuffer0( buffer );
    assert bufferFromString0( result ).equals( buffer );
    return result;
}

public static Buffer bufferFromString( String s )
{
    Buffer result = bufferFromString0( s );
    assert stringFromBuffer( result ).equals( s );
    return result;
}

private static String stringFromBuffer0( Buffer buffer )
{
    byte[] bytes = buffer.getBytes();
    int end = bytes.length;
    char[] chars = new char[end];
    int t = 0;
    for( int s = 0; s < end; )
    {
        int b1 = bytes[s++] & 0xff;
        assert b1 >> 4 >= 0;
        if( /*b1 >> 4 >= 0 &&*/ b1 >> 4 <= 7 ) /* 0x0xxx_xxxx */
            chars[t++] = (char)b1;
        else if( b1 >> 4 >= 8 && b1 >> 4 <= 11 ) /* 0x10xx_xxxx */
            throw new MalformedUtf8Exception( s - 1 );
        else if( b1 >> 4 >= 12 && b1 >> 4 <= 13 ) /* 0x110x_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x1f) << 6) | (b2 & 0x3f));
        }
        else if( b1 >> 4 == 14 ) /* 0x1110_xxxx 0x10xx_xxxx 0x10xx_xxxx */
        {
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b2 = bytes[s++] & 0xff;
            assert (b2 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            assert s < end : new IncompleteUtf8Exception( s - 1 );
            int b3 = bytes[s++] & 0xff;
            assert (b3 & 0xc0) == 0x80 : new MalformedUtf8Exception( s - 1 );
            chars[t++] = (char)(((b1 & 0x0f) << 12) | ((b2 & 0x3f) << 6) | (b3 & 0x3f));
        }
        else /* 0x1111_xxxx */
            throw new MalformedUtf8Exception( s - 1 );
    }
    return new String( chars, 0, t );
}

private static Buffer bufferFromString0( String s )
{
    char[] chars = s.toCharArray();
    byte[] bytes = new byte[chars.length * 3];
    int p = 0;
    for( char c : chars )
    {
        if( (c >= 1) && (c <= 0x7f) )
            bytes[p++] = (byte)c;
        else if( c > 0x07ff )
        {
            bytes[p++] = (byte)(0xe0 | ((c >> 12) & 0x0f));
            bytes[p++] = (byte)(0x80 | ((c >> 6) & 0x3f));
            bytes[p++] = (byte)(0x80 | (c & 0x3f));
        }
        else
        {
            bytes[p++] = (byte)(0xc0 | ((c >> 6) & 0x1f));
            bytes[p++] = (byte)(0x80 | (c & 0x3f));
        }
    }
    if( p > 0xffff )
        throw new StringTooLongException( p );
    return Buffer.create( bytes, 0, p );
}

How to convert the "Java modified UTF-8" to the regular UTF-8 and back?

3 Answers3