4

Is it possible to create an invalid UTF8 string using Javascript?

Every solution I've found relies String.fromCharCode which generates undefined rather than an invalid string. I've seen mention of errors being generated by ill-formed UTF8 string (i.e. https://developer.mozilla.org/en-US/docs/Web/API/WebSocket#send()) but I can't figure out how you would actually create one.

Mattia
  • 2,251
  • 1
  • 22
  • 27
  • The error mentioned there is not about UTF-8 strings, and javascript typically does not use UTF-8 to represent strings internally. – pvg Sep 11 '17 at 01:20
  • @pvg: Thanks for pointing out the mistake. Not sure why I assumed UTF8 was the javascript encoding. My question should have been more specific: How can you create a string that contains unpaired surrogates? – Mattia Sep 11 '17 at 09:37
  • I'm not entirely sure and the docs seem pretty vague although it's possible to reach into the bowels of javascript strings and do a lot of strange things without things instantly catching fire. https://i.imgur.com/sWVE0IY.png – pvg Sep 11 '17 at 12:14

2 Answers2

4

One way to generate an invalid UTF-8 string with JavaScript is to take an emoji and remove the last byte.

For example, this will be an invalid UTF-8 string:

const invalidUtf8 = ''.substr(0,5);
laurent
  • 88,262
  • 77
  • 290
  • 428
3

A string in JavaScript is a counted sequence of UTF-16 code units. There is an implicit contract that the code units represent Unicode codepoints. Even so, it is possible to represent any sequence of UTF-16 code units—even unpaired surrogates.

I find String.fromCharCode(0xd801) returns the replacement character, which seems quite reasonable (rather than undefined). Any text function might do that but, for efficiency reasons, I'm sure that many text manipulations would just pass invalid sequences through unless the manipulation required interpreting them as codepoints.

The easiest way to create such a string is with a string literal. For example, "\uD83D \uDEB2" or "\uD83D" or "\uDEB2" instead of the valid "\uD83D\uDEB2".

"\uD83D \uDEB2".replace(" ","") actually does return "\uD83D\uDEB2" ("") but I don't think you should count on anything good coming from a string that isn't a valid UTF-16 encoding of Unicode codepoints.

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72
  • No good coming from it is exactly what I hoped ;) I'm trying to reliably generate an error in a websocket for testing purposes. Unfortunately the string literal you offered gets transformed into two replacement characters separated by a space (in Chrome at least). Thanks for the information though. It was still useful to know. – Mattia Sep 15 '17 at 13:39