0

Assuming I have a form element that should allow pretty much any reasonable string that names something (IE like the title of this question).

How do I validate that the string is reasonable, and not something weird or unsafe? (Assuming here that something like Unicode emoticons☺ are reasonable)

Checking for all the escape chars like newlines, form feeds etc, is of course a given. Things like length is harder though, since an English descriptive name is very hard with just 1 char, but trivial in Chinese.

There are ~31 Unicode classes, which ones are safe?

What would a complete regex or similar check look like in Javascript or C#?

Cine
  • 4,255
  • 26
  • 46
  • Just for context, what is it that you are trying to make it safe for? Is it safety for insertion into a database, or for XSS, or unvalidated user input? – gmiley Nov 30 '16 at 02:56
  • All of the above. – Cine Nov 30 '16 at 03:02
  • 1
    Well as long as you use parameterized queries for your sql commands, any input will be safe against sql injection. At that point you just need to specify in your database that the column you are storing data in is unicode. When you display any of the content you will want to ensure that you use HTML Encoding procedures, which are available in javascript as well as the majority of serverside scripting languages. – gmiley Nov 30 '16 at 03:15
  • `[A-Z]` is only option that is somewhat safe for all possible and plausible places to send such data. Even lowercase letters may break code using incorrect UTF-7 encoding... Definitely digits can be mistakenly treated as numbers with bad code... {end of trolling} - you need to define what you trying to achieve much better than "safe" for question to be answerable. – Alexei Levenkov Nov 30 '16 at 03:39
  • @AlexeiLevenkov an input like the title of this question – Cine Nov 30 '16 at 04:07

1 Answers1

0

How do I validate that the string is reasonable, and not something weird or unsafe?

It's not clear what you mean by ‘unsafe’. As @gmiley said, you can't protect against injection issues like XSS by filtering input; this is an output escaping issue.

As for ‘reasonable’ a good starting place would be:

  • As you mentioned, disallowing control characters: U+0000–U+001F and U+007F–U+009F, minus newline and maybe tab if you want to allow those.

  • Especially for web applications, disallowing characters that are ‘unsuitable for use in markup’ according to the unicode-xml note. This prevents layout tricks like the Right-to-Left-Override.

  • Unicode normalisation (String.Normalize in C#), for example Normal Form C to standardise the code points for combining accents, or Normal Form KC to also flatten oddities like fullwidth text, which you might or might not want to do depending on audience.

  • If you don't like Zͪa̻͍l̀g̐ͦ͢oͬ̓ṯ̺ͮěͧ̚͞x͕̀̇ṱ̢͖̩̮̆̃ͤ you might like to consider limiting consecutive combiners.

Community
  • 1
  • 1
bobince
  • 528,062
  • 107
  • 651
  • 834