1

I'm working on a vulnerability scanner for webapps and I ran in a problem I can't seem to solve. Webapps usually use UTF8 encoding, which uses 1-4 bytes per character. For example, a 4-byte character in UTF8 encoding would start with the byte "11110xxx" followed by 3 more bytes that look like "10xxxxxx".

I was reading more about UTF8 and found that it can also support 5 and 6 bytes per character. If the character starts with "111110xx" then it's a 5-byte character and if it starts with "1111110x" then it's a 6-byte character.

I want to inject such characters in webapps (via my scanner) and see if they break. I was trying to use utf8.js library (found on npm) to create such characters but it turns out this library supports UTF8 only up to 4 bytes per character.

How do I programmatically create a character that uses 5 or 6 bytes with JavaScript?

Isabella
  • 11
  • 1
  • 2
    JavaScript strings are always strings of UTF-16 characters. To create a Unicode code point that requires more than 16 bits requires the use of a [surrogate code pair](http://www.russellcottrell.com/greek/utilities/surrogatepaircalculator.htm#info) which is two UTF-16 characters. There is no such thing as a 5 or 6 byte UTF-8 representation; the old spec was changed about 15 years ago. – Pointy Nov 03 '18 at 12:52
  • [See this other question for more.](https://stackoverflow.com/questions/9533258/what-is-the-maximum-number-of-bytes-for-a-utf-8-encoded-character) – Pointy Nov 03 '18 at 12:53
  • Yes, JavaScript's strings are counted sequences of UTF-16 code units (like in many languages since VB4 and Java). It's a bit confusing to say "UTF-8 character" or "UTF-16 character". UTF-8 and UTF-16 are both character encodings for the Unicode character set. They represent a Unicode codepoint with one or more code units (which are then serialized as bytes). – Tom Blodget Nov 03 '18 at 19:48

0 Answers0