A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
I think you misunderstand what "UTF-8 characters" means; UTF-8 is an encoding of Unicode which can represent any character, glyph, and grapheme that is defined in the (ever growing) Unicode standard. There are fewer Unicode code points than there are possible UTF8 byte values, so the only "invalid UTF8 characters" are UTF8 byte sequences that don't map to any Unicode code point, but I assume this is not what you're referring to.
for example, a copy and paste from a rtf file that contains tabs.
RTF is a formatting system which works independently of the underlying encoding scheme - you can use RTF with ASCII, UTF-8, UTF-16 and other encodings. With respect to the HTML textboxes in your post, both the <input type="text">
and <textarea>
elements in HTML only respect plaintext, so any RTF formatting will be automatically stripped when pasted by a user, hence why JS-heavy "rich-edit" and contenteditable
components are notuncommon in web-applications, though in this answer I assume you're not using a rich-edit component in a web-page).
Tabs in RTF files are not an RTF feature: they're just normal ASCII-style tab characters, i.e. \t
or 0x09
, which also appear in Unicode, and thus, can also appear in UTF-8 encoded text; furthermore, it's perfectly valid for web-browsers to allow users to paste those into <input>
and <textarea>
.
Javascript (ECMAScript) itself is Unicode-native; that is, the ECMAScript specification does require JS engines to use UTF-16 representations in some places, such as in the abstract operation IsStringWellFormedUnicode
:
7.2.9 Static Semantics:
IsStringWellFormedUnicode
The abstract operation
IsStringWellFormedUnicode
takes argument string (aString
) and returns aBoolean
. It interprets string as a sequence of UTF-16 encoded code points, as described in 6.1.4, and determines whether it is a well formed UTF-16 sequence.
...but that part of the specification is intended for JS engine programmers, and not people who write JS for use in browsers - in fact, I'd say it's safe to asume that within a web-browser, any-and-all JS string
values will always be valid strings that can always be serialized out to UTF-8 and UTF-16, and also that JS scripts should not be concerned with the actual in-memory encoding of the string's content.
So given that your question is written as this:
A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
I'm going to interpret it as this:
A user can copy RTF text from a program like WordPad and paste it into a HTML
<textarea>
or<input type="text">
in a web-browser, and when it's pasted the plaintext representation of the RTF still contains certain characters that my application should not accept such as whitespace like tabs.How can I detect these unwanted characters and inform the user - or remove those unwanted characters?
...to which my answer is:
I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )
let textBoxContent = document.getElementById( 'myTextarea' ).value;
textBoxContent = textBoxContent.replace( /[^\x20-\x7E]+/g, '' );
The expression [^\x20-\x7E]
matches any character NOT in the codepoint range 0x20
(32, a normal space character ' '
) to 0x7E
(127, the tidle '~'
character), all other characters will be removed, including non-Latin text.
The g
switch at the end makes it a global find-and-replace operation; without the g
then only the first unwanted character would be removed.
The range 0x20-0x7E
works because Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/
Just an idea:
function checkUTF8(text) {
var utf8Text = text;
try {
// Try to convert to utf-8
utf8Text = decodeURIComponent(escape(text));
// If the conversion succeeds, text is not utf-8
}catch(e) {
// console.log(e.message); // URI malformed
// This exception means text is utf-8
}
return utf8Text; // returned text is always utf-8
}