I'm working on a vulnerability scanner for webapps and I ran in a problem I can't seem to solve. Webapps usually use UTF8 encoding, which uses 1-4 bytes per character. For example, a 4-byte character in UTF8 encoding would start with the byte "11110xxx" followed by 3 more bytes that look like "10xxxxxx".
I was reading more about UTF8 and found that it can also support 5 and 6 bytes per character. If the character starts with "111110xx" then it's a 5-byte character and if it starts with "1111110x" then it's a 6-byte character.
I want to inject such characters in webapps (via my scanner) and see if they break. I was trying to use utf8.js library (found on npm) to create such characters but it turns out this library supports UTF8 only up to 4 bytes per character.
How do I programmatically create a character that uses 5 or 6 bytes with JavaScript?