14

A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.

How can I check if a string is a valid UTF-8?

pavlos163
  • 2,730
  • 4
  • 38
  • 82
Shoebie
  • 1,263
  • 2
  • 12
  • 24
  • 1
    may be help you http://stackoverflow.com/questions/20639052/check-if-the-bytes-sequence-is-valid-utf-8-sequence-in-javascript – Hadi J Mar 30 '16 at 17:02
  • Looks like similar to [Validating user's UTF-8 name in Javascript](http://stackoverflow.com/questions/6381752/validating-users-utf-8-name-in-javascript) – Abhijit Mar 30 '16 at 17:03

2 Answers2

6

Exposition

I think you misunderstand what "UTF-8 characters" means; UTF-8 is an encoding of Unicode which can represent any character, glyph, and grapheme that is defined in the (ever growing) Unicode standard. There are fewer Unicode code points than there are possible UTF8 byte values, so the only "invalid UTF8 characters" are UTF8 byte sequences that don't map to any Unicode code point, but I assume this is not what you're referring to.

for example, a copy and paste from a rtf file that contains tabs.

RTF is a formatting system which works independently of the underlying encoding scheme - you can use RTF with ASCII, UTF-8, UTF-16 and other encodings. With respect to the HTML textboxes in your post, both the <input type="text"> and <textarea> elements in HTML only respect plaintext, so any RTF formatting will be automatically stripped when pasted by a user, hence why JS-heavy "rich-edit" and contenteditable components are notuncommon in web-applications, though in this answer I assume you're not using a rich-edit component in a web-page).

Tabs in RTF files are not an RTF feature: they're just normal ASCII-style tab characters, i.e. \t or 0x09, which also appear in Unicode, and thus, can also appear in UTF-8 encoded text; furthermore, it's perfectly valid for web-browsers to allow users to paste those into <input> and <textarea>.


Javascript (ECMAScript) itself is Unicode-native; that is, the ECMAScript specification does require JS engines to use UTF-16 representations in some places, such as in the abstract operation IsStringWellFormedUnicode:

7.2.9 Static Semantics: IsStringWellFormedUnicode

The abstract operation IsStringWellFormedUnicode takes argument string (a String) and returns a Boolean. It interprets string as a sequence of UTF-16 encoded code points, as described in 6.1.4, and determines whether it is a well formed UTF-16 sequence.

...but that part of the specification is intended for JS engine programmers, and not people who write JS for use in browsers - in fact, I'd say it's safe to asume that within a web-browser, any-and-all JS string values will always be valid strings that can always be serialized out to UTF-8 and UTF-16, and also that JS scripts should not be concerned with the actual in-memory encoding of the string's content.

Your question

So given that your question is written as this:

A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.

How can I check if a string is a valid UTF-8?

I'm going to interpret it as this:

A user can copy RTF text from a program like WordPad and paste it into a HTML <textarea> or <input type="text"> in a web-browser, and when it's pasted the plaintext representation of the RTF still contains certain characters that my application should not accept such as whitespace like tabs.

How can I detect these unwanted characters and inform the user - or remove those unwanted characters?

...to which my answer is:

I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )

let textBoxContent = document.getElementById( 'myTextarea' ).value;
textBoxContent = textBoxContent.replace( /[^\x20-\x7E]+/g, '' );
  • The expression [^\x20-\x7E] matches any character NOT in the codepoint range 0x20 (32, a normal space character ' ') to 0x7E (127, the tidle '~' character), all other characters will be removed, including non-Latin text.

  • The g switch at the end makes it a global find-and-replace operation; without the g then only the first unwanted character would be removed.

  • The range 0x20-0x7E works because Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/

Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
Dai
  • 141,631
  • 28
  • 261
  • 374
  • 9
    To correct some misconceptions in this answer, too: there is no such thing as UTF8 "characters"; as an encoding scheme there are "UTF8 byte sequences", encoding Unicode code points, and these byte sequences can *absolutely* suffer from illegal values in the byte sequence. Similarly, Unicode as the formal mapping of "orthographic constructs" to numerical codes *also* has certain numbers that may not be used. Encountering a UTF8 byte stream with an illegal byte sequence, or a decoded Unicode sequence containing illegal numbers, is entirely possible, so: yes, there are "invalid UTF-8 characters". – Mike 'Pomax' Kamermans Apr 14 '16 at 00:47
  • @Mike'Pomax'Kamermans I've rewritten my answer to implement your feedback; thank you for the input. – Dai Feb 18 '23 at 17:29
  • I've further edited your text because that's not a technical detail if the whole point of the paragraph is to explain that the answer is "yes", but that the question the answer is "yes" to isn't what they wanted to know. – Mike 'Pomax' Kamermans Feb 18 '23 at 17:47
  • @Mike'Pomax'Kamermans – Dai Feb 18 '23 at 17:47
2

Just an idea:

function checkUTF8(text) {
    var utf8Text = text;
    try {
        // Try to convert to utf-8
        utf8Text = decodeURIComponent(escape(text));
        // If the conversion succeeds, text is not utf-8
    }catch(e) {
        // console.log(e.message); // URI malformed
        // This exception means text is utf-8
    }   
    return utf8Text; // returned text is always utf-8
}
  • 4
    `escape` is deprecated and should not be used (because it can't handle Unicode properly) – Quentin Jan 04 '18 at 12:37
  • What does "text is not utf-8" mean? It seems this means text is ASCII? and in the catch it is unicode? – xeruf Mar 01 '23 at 09:23