How to determine whether a set of characters in JavaScript is UTF-8 or not?

Question

This is sort of a variation of previously asked questions, but I still am unable to find an answer, so I'm trying to distill it to the core of the problem in hopes there is a solution.

I have a database in which, for historical reasons, certain text entries are not UTF-8. Most are. And all entries made the last 3 years are. But some older entries are not.

It is important to find the non-UTF-8 characters so I can either avoid them or convert them to UTF-8 for some XML I'm trying to generate.

The server-side JavaScript I'm using has a ByteBuffer type, so I can treat any set of characters as individual bytes and examine them as needed, and do not need to use the String type, which I understand is problematic in this situation.

Is there any check of text I can do to determine if it is valid UTF-8 or not in this case?

I've been searching for a couple of months now (;_;) and still have not been able to find an answer. Yet there must be a way of doing it, because XML validators (like in the major browsers) are able to report "encoding errors" when they run across non-UTF-8 characters.

I would just like to know any algorithm for how that is done so I can try to do the same sort of test in JavaScript. Once I know which characters are bad I can convert them from ISO-8859-1 (for example) to UTF-8. I have methods for that.

I just don't know how to figure out which characters are not UTF-8. Again, I understand that using the JavaScript String type is problematic in this situation, but I do have an alternative ByteBuffer type which can handle characters on a per byte basis.

Thanks for any specific tests people can suggest.

doug

Would it be correct to say that if a character's high order bit is 0 it is legal because it is the same as an ASCII character. And if the high order bits consist of n 1s that the subsequent n bytes must have high order bits of 10, and if that is true it is legal UTF-8 otherwise it isn't? — Doug Lerner, Feb 17 '14 at 01:13
For what's it's worth, my own JavaScript tests with the tests mentioned at http://stackoverflow.com/questions/1275948/how-to-detect-if-a-string-is-encoded-with-escape-or-encodeuricomponent indicate that a string that is clearly not correct UTF-8 is valid UTF-8. So I'm still stuck on trying to figure out how to detect invalid UTF-8. — Doug Lerner, Feb 19 '14 at 03:34

score 1 · Answer 1 · answered Oct 06 '16 at 18:40

I have the same situation and problem. All server side JavaScript strings are 16 bit, but if I get a JSON from an endpoint it can be: UTF-8, ANSI (ASCII), UCS2_BE, UCS2_LE. UTF16 is naturally converted nicely to a JavaScript 16 bit string, and that’s a problem, since variable length character encoding will cause SQL injection errors in AWS. The server side JavaScript that I use will however do some bit shifting or padding for UTF-8 that results in a 16 bit JavaScript string starting with ï»¿ That’s good, since I don’t have 8bit strings in JavaScript I just check for the 3 first chars being ï»¿

You may not have the same luck with the bitshifting, but the below function worked for me. I’m sure there is a nicer, faster better solution but this post has been out for 2 years, 715 views and not a single solution.

Anders

Just call it:

var bolResult = isEncoded(strJSON);

/**
 * @description Check if string is UTF8 encoded
 * @param {string} JSON
 * @returns {boolean} true/false
 */
function isEncoded(strJSON) {
        /***************************
         * Valid string starts with:
         * ï»¿{
         * 239, 187, 191
         ********************/
        var intCharCode0 = strJSON.charCodeAt(0);   //239
        var intCharCode1 = strJSON.charCodeAt(1);   //187
        var intCharCode2 = strJSON.charCodeAt(2);   //191

        if(intCharCode0 === 239 && intCharCode1 === 187 && intCharCode2 === 191){
            return true;
        }
        else{
            return false;
        }
}

If you have a JavaScript string and the encoding is not UTF-16, something has gone wrong. Look to preventing that. — Tom Blodget, Oct 06 '16 at 23:58

How to determine whether a set of characters in JavaScript is UTF-8 or not?

1 Answers1