how to figure out which character doesn't map to utf-8

Question

I maintain a small java servlet-based webapp that presents forms for input, and writes the contents of those forms to MariaDB.

The app runs on a Linux box, although the users visit the webapp from Windows.

Some users paste text into these forms that was copied from MSWord docs, and when that happens, they get internal exceptions like the following:

Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Incorrect string value: '\xC2\x96 for...' for column 'ssimpact' at row 1

For instance, I tested it with text like the following:

Project – for

Where the dash is a "long dash" from the MSWord document.

I don't think it's possible to convert the wayward characters in this text to the "correct" characters, so I'm trying to figure out how to produce a reasonable error message that shows a substring of the bad text in question, along with the index of the first bad character.

I noticed postings like this: How to determine if a String contains invalid encoded characters .

I thought this would get me close, but it's not quite working.

I'm trying to use the following method:

private int findUnmappableCharIndex(String entireString) {
    int charIndex;
    for (charIndex = 0; charIndex < entireString.length(); ++ charIndex) {
        String  currentChar   = entireString.substring(charIndex, charIndex + 1);
        CharBuffer  out = CharBuffer.wrap(new char[currentChar.length()]);
        CharsetDecoder  decoder = Charset.forName("utf-8").newDecoder();
        CoderResult result  = decoder.decode(ByteBuffer.wrap(currentChar.getBytes()), out, true);
        if (result.isError() || result.isOverflow() || result.isUnderflow() || result.isMalformed() || result.isUnmappable()) {
            break;
        }
        CoderResult flushResult = decoder.flush(out);
        if (flushResult.isOverflow()) {
            break;
        }
    }
    if (charIndex == entireString.length() + 1) {
        charIndex   = -1;
    }
    return charIndex;
}

This doesn't work. I get "underflow" on the first character, which is a valid character. I'm sure I don't fully understand the decoder mechanism.

`\xC2\x96` is a valid UTF-8 sequence for a control character `u0096` _Start Of Guarded Area_ and it's not related to a _En Dash_ as such. What happens if you would try "_Paste as plain text_" or something alike? — JosefZ, Aug 10 '17 at 20:40

how to figure out which character doesn't map to utf-8

0 Answers0