2

Is there anyway I can define the encoding in text areas using HTML and pure JS?
I want to have them not permitting special unicode characters (such as ♣♦♠).
The valid character range (for my purpose) is from Unicode code point U+0000 to U+00FF.
It is OK to silently replace invalid characters with an empty string upon form-submission (without warning to the user).

GitaarLAB
  • 14,536
  • 11
  • 60
  • 80
Heinzen
  • 173
  • 1
  • 13
  • Please clarify 'special unicode characters' that you want to exclude OR specify the characters you want to accept. I ask because: if you would (for example) use the form's attribute `accept-charset`, you'd still be accepting characters (from that char-set) between 0x7F and 0xFF. Also this would/could have an impact on how the data will be submitted to your server. – GitaarLAB Oct 27 '14 at 13:01
  • That's pretty vague.. UTF-8 is an *encoding* (variable byte-length) of Unicode *charset* (of which you want to exclude 'special characters').. Please, specify your needs further: what character(s/range(s)) are allowed, which character(s/range(s)) are dis-allowed? – GitaarLAB Oct 27 '14 at 13:06
  • Ok, that clears up the required range. Now how do you want to handle exceptions? (a:) replace them with nothing (an empty string) while the user types in the textarea (which might make the textarea's cursor jump back to the the textarea's first character-position) (b:) Warn the user while typing or on submission (c:) silently replace the illegal characters on submission (d:) popup a warning-screen informing the user of which characters are dis-allowed (and where) (and give them the opportunity to change them) etc etc etc, mix and match.. So, How do you want to handle the exceptions? – GitaarLAB Oct 27 '14 at 13:19
  • Replacing with empty strings on submission without warning. – Heinzen Oct 27 '14 at 13:24
  • I took the liberty of adding the relevant clarifications given in your comments to your question. – GitaarLAB Oct 27 '14 at 14:14

2 Answers2

1

So, as you have clarified in your comments: you want to replace the characters you deem illegal with empty strings on form-submission without warning.

Given the following example html (body content):

<form action="demo_form.asp">
  First name: <input type="text" name="fname" /><br>
  Last name:  <input type="text" name="lname" /><br>
  Likes:      <textarea name="txt_a"></textarea><br>
  Dislikes:   <textarea name="txt_b"></textarea><br>
  <input type="submit" value="Submit">
</form>

Here is a basic concept javascript:

function demo(){
  for( var elms=this.getElementsByTagName('textarea')
       ,      L=elms.length
     ; L--
     ; elms[L].value=elms[L].value.replace(/[^\u0000-\u00FF]/g,'')
     ); 
}
window.onload=function(){
  document.forms[0].onsubmit=demo; //hook form's onsubmit use any method you like
};

The basic idea is to force the browser's regex engine to match on Unicode (not local charset) using the \uXXXX notation.
Then we simply make a range: [\u0000-\u00FF] and finally specify we want to match on everything outside that range: [^\u0000-\u00FF].
Everything that matches those criteria will be replaced by '' (an empty string) on form-submission. No warning no nothing.
You can/should freely expand this concept to incorporate this into your code (in a way that fits your code-flow) (and where needed, apply it to input type="text" etc), depending on your further requirements.

This should get you started!

EDIT:
Note that your current valid-range specification (\u0000-\u00FF) will effectively dis-allow all such 'pesky' special characters like:

  • fancy quotes ‘ ’ “ ”
    (that's a great feature for people copying from Word etc.),
  • € ™ Œ œ, etc.

But, it will nicely include the full C1 control-block (all 32 control-characters). However on the other hand.. it's consistent with including the full C0 control-block.
Effectively, this is now your (what you requested) valid char-set: http://en.wikipedia.org/wiki/ISO/IEC_8859-1

As you can now see, there is a lot more to this. That is why sane applications (finally) are starting to use Unicode (usually encoded for the web as UTF-8) and just accept what the users provide (within (extremely clearly specified) reason)!
Most common validation-questions are (in the real world) nothing more than a high-school-class example of the concept of validating (and even more to the point: to explain the basics of regular expressions with what is considered to be easily understandable examples, like name/email/address). Sadly they are wildly applied even by some government identity-systems (up to passports etc) to people's names, addresses etc. In fact: even the full current Unicode cannot represent every person's name (in native writing) on the planet (that is actually still alive)!! Real world example: try entering and leaving a commercial flight when your boarding-pass has a different credentials then your passport (regardless of which one is wrong).. 'Just' an umlaut missing is going to be a problem somewhere, worse example, imagine an woman with a German first name, Thai last name and married to a man with a Mandarin last name..
Source: xkcd.com/1171/

Finally: Please do realize that in most cases this whole exercise is useless (if you do it silently without warning), because:
you may never just accept user-input on the server-side without proper cleanup, so you are already (silently without the user knowing it) cleaning up your input to the form that you require (to a novice programmer (that forgets to think about (for example) users with javascript disabled,) this sometimes feels like repeating the work already done in javascript on the client-side)...
Usually, the only use of replicating the server-side behavior on the client-side (usually using javascript) is so the user dynamically knows what would be dis-allowed by the server (without sending data back and forth) and can adapt accordingly!

Community
  • 1
  • 1
GitaarLAB
  • 14,536
  • 11
  • 60
  • 80
  • 1
    Thanks for the solution, not only it works as you have have brought up some really useful info. A+ reply! Also as for the silent removal, there isn't any serious problems because people accessing it do have in mind what is allowed or not, it is just for preventing that unwanted people try to send broken data. Thank you once again! – Heinzen Oct 27 '14 at 15:24
  • Thank you, you're very welcome! Thank you for posting a clearly defined, yet broadly applicable question (as that is what made it possible to add the more important and relevant information (in general) to my answer)! – GitaarLAB Oct 27 '14 at 15:33
0

You can use form attribute accept-charset

The accept-charset attribute specifies the character encodings that are to be used for the form submission.

The default value is the reserved string "UNKNOWN" (indicates that the encoding equals the encoding of the document containing the element).

See this documentation http://www.w3schools.com/tags/att_form_accept_charset.asp

I cannot say if this will protect the text field but at least it controls what character set is submitted by the form.

Actually this issue has already been answered javascript to prevent writing into form elements after n utf 8 characters

Community
  • 1
  • 1
Jack Shultz
  • 2,031
  • 2
  • 30
  • 53
  • I have tried using the form method and it does not solve my problem. I am not sure whether I did it wrong but I was still getting unicode characters being sent to my backend. – Heinzen Oct 27 '14 at 13:01
  • I just thought of another solution, maybe you could put a listener on the fields. Then do a regular expression search for UTF-8 characters, and then remove them? – Jack Shultz Oct 27 '14 at 13:03
  • UTF-8 characters??? Yes, we'd do a regular expression something on the input, but to do that, we need to know the valid and invalid ranges the op requires. – GitaarLAB Oct 27 '14 at 13:08
  • I am going to check that reference, thanks! Gitaar, check my comment on OP – Heinzen Oct 27 '14 at 13:12
  • I misunderstood what that post does. Its actually just counting the length, sorry. – Jack Shultz Oct 27 '14 at 13:14
  • So I think you need to do a search for non-ascii characters like this `str.replace(/[^\x00-\x7F]/g, "");` – Jack Shultz Oct 27 '14 at 13:18