Check if the bytes sequence is valid UTF-8 sequence in Javascript

Question

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

P.S.: I am receiving data from external API and sometimes (very rarely but it happens) it returns data with invalid UTF-8 sequences. Trying to put them into PostgreSQL results in an appropriate error.

I don't think that really makes any sense. A string is a list of characters. UTF-8 is a way of representing characters in a binary format. A string in itself does not have an encoding. — njzk2, Dec 17 '13 at 16:12
unless you are trying to determine if a string can be represented completely using utf-8 encoding ? — njzk2, Dec 17 '13 at 16:12
the only way to check for a valid UTF8 is to check whether or not it contains **invalid** utf8 chars. The regex you linked is an effective, concise and efficient way to perform the check. You can, of course, check against your own dictionary, in a custom tuned way. — PA., Dec 17 '13 at 16:13
I don't know of any built-in method so last time I needed this, I used `text.match(/[\x80-\xFF]+/)` to gather *potential* problems, and checked each match against the UTF-8 specification -- 52 lines of code. Using that regexp is actually a pretty neat, fast, and simple way. — Jongware, Dec 17 '13 at 16:14
I am receiving data from API and sometimes (very rare but it happens) it returns data with invalid utf-8 seqences. Trying to put them into postgres results in appropriate error. — zavg, Dec 17 '13 at 16:14
or you are trying to figure out if a sequence of bytes can be interpreted as an utf-8 encoded string? — njzk2, Dec 17 '13 at 16:15

score 5 · Accepted Answer · edited Jul 26 '21 at 21:28

UTF-8 is in fact a simple encoding, but still what you are asking can't be done with a one-liner. You have to:

Override the Content-Type of the response to have a byte array in your script and prevent the browser/library to interpret the response itself
Looping over the bytes to make characters. Note that UTF-8 is a variable-length encoding, and that's why some sequences are invalid.
If an invalid octet is found, skip it
If needed, deserialize the JSON/XML/whatever string to a JavaScript object, possibly by handing failures

Deciding if a certain array is a valid UTF-8 sequence is quite a straightforward task (just a bunch of if statements and bit shiftings), but again it's not a one line thing.

Check if the bytes sequence is valid UTF-8 sequence in Javascript

1 Answers1

Linked