5

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

Regex to detect invalid UTF-8 string

P.S.: I am receiving data from external API and sometimes (very rarely but it happens) it returns data with invalid UTF-8 sequences. Trying to put them into PostgreSQL results in an appropriate error.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
zavg
  • 10,351
  • 4
  • 44
  • 67
  • 1
    I don't think that really makes any sense. A string is a list of characters. UTF-8 is a way of representing characters in a binary format. A string in itself does not have an encoding. – njzk2 Dec 17 '13 at 16:12
  • unless you are trying to determine if a string can be represented completely using utf-8 encoding ? – njzk2 Dec 17 '13 at 16:12
  • the only way to check for a valid UTF8 is to check whether or not it contains **invalid** utf8 chars. The regex you linked is an effective, concise and efficient way to perform the check. You can, of course, check against your own dictionary, in a custom tuned way. – PA. Dec 17 '13 at 16:13
  • 1
    I don't know of any built-in method so last time I needed this, I used `text.match(/[\x80-\xFF]+/)` to gather *potential* problems, and checked each match against the UTF-8 specification -- 52 lines of code. Using that regexp is actually a pretty neat, fast, and simple way. – Jongware Dec 17 '13 at 16:14
  • I am receiving data from API and sometimes (very rare but it happens) it returns data with invalid utf-8 seqences. Trying to put them into postgres results in appropriate error. – zavg Dec 17 '13 at 16:14
  • 2
    or you are trying to figure out if a sequence of bytes can be interpreted as an utf-8 encoded string? – njzk2 Dec 17 '13 at 16:15

1 Answers1

5

UTF-8 is in fact a simple encoding, but still what you are asking can't be done with a one-liner. You have to:

  1. Override the Content-Type of the response to have a byte array in your script and prevent the browser/library to interpret the response itself
  2. Looping over the bytes to make characters. Note that UTF-8 is a variable-length encoding, and that's why some sequences are invalid.
  3. If an invalid octet is found, skip it
  4. If needed, deserialize the JSON/XML/whatever string to a JavaScript object, possibly by handing failures

Deciding if a certain array is a valid UTF-8 sequence is quite a straightforward task (just a bunch of if statements and bit shiftings), but again it's not a one line thing.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Raffaele
  • 20,627
  • 6
  • 47
  • 86