1

In my application client is uploading data from MS word to Textarea. My RegEx skills are not so good :)

I need a RegEx to filter all the junk characters from string and the only acceptable input is characters from keyboard. i.e, A-Z, a-z, 0-9 and all the special chracters present on keyboard + all currency symbols.

EDIT: I want to allow only ascii codes including extended. http://www.asciitable.com/

NoobDeveloper
  • 1,877
  • 6
  • 30
  • 55

1 Answers1

6

I have checked the ASCII table and all printable symbols it contains are present on any standard keyboard.

It's hard to tell what defines "special characters present on the keyboard" but I assume you mean printable non-alphanumeric characters. While all the unicode whitespace characters (non-braking space, zero-width word non-joiner...) are indeed "special", they are absent from most keyboards. The backspace character, while present on most keyboards, is typically interpreted by the OS, so I assume you don't want that. A similar argument applies to the tab key: while the tab character is easier to obtain than the newline character, it can't normally be typed into a form input.

Concerning currency symbols, the character class \p{Sc} covers them, and C# regex seems to support this class

Non-US keyboards contain many more characters (symbols with diacritics, cyrillic, chinese/japanese/korean characters), but they don't match your description of "A-Z, a-z, 0-9 and all the special chracters present on keyboard + all currency symbols". Of special interest is the japanese end-of-sentence punctuation, which is a hollow circle instead of just a dot. However, while it matches your description, I believe you don't want that either.

C# also supports \p{isBasicLatin}, but that includes the ASCII control characters, which I assume you don't want.

To sum up: your description matches the entire printable ASCII range and the newline \n. To check a string is made out of these, use this regex:

^[\x20-\x7E\n\p{Sc}]$

Reflecting your edit, also consider all printable ASCII characters (most currency symbols are absent, $ isn't) + newline

^[\x20-\x7E\n]$

or the entire ASCII range including the control characters and all ASCII whitespace:

^[\x00-\x7F]$
^[\p{isBasicLatin}]$

Ref:
MSDN character classes
MSDN character escapes
MSDN code example (adapted here):

bool IsValid(string strIn)
{
    // Return true if strIn is in valid format.
    return Regex.IsMatch(strIn, @"^[\x20-\x7E\n\p{Sc}]$");

}

regex replace (adapted here; strips out everything except A-Z, a-z , 0-9 and following characters. ~ ` ! @ # $ % ^ & * ( ) _ + | - = \ { } [ ] : " ; ' < > ? , . /)

String CleanInput(string strIn)
{
    // Replace invalid characters with empty strings.
    return Regex.Replace(strIn,
          @"[^a-zA-Z0-9`!@#$%^&*()_+|\-=\\{}\[\]:"";'<>?,./]", ""); 
}

Concerning double quotes inside verbatim string literals: http://blogs.msdn.com/b/gusperez/archive/2005/08/10/450257.aspx

John Dvorak
  • 26,799
  • 13
  • 69
  • 83
  • Thanks @Jan Dvorak Can you give me RegEx which will strip out everything except A-Z, a-z , 0-9 and following characters. ~ ` ! @ # $ % ^ & * ( ) _ + | - = \ { } [ ] : " ; ' < > ? , . / – NoobDeveloper Feb 13 '13 at 14:43
  • @Nexus I've added an example how to regex-replace in C# – John Dvorak Feb 13 '13 at 14:56
  • Thanks Once again. How do i allow double & single quotes from MS word in string ? “ ‘ – NoobDeveloper Feb 14 '13 at 09:53
  • @Nexus sorry buddy, I have no idea how to recognize which single and double quotes come from MS word. Allowing _all_ single and double quotes should be easy (note that they should already be allowed.) – John Dvorak Feb 14 '13 at 10:59
  • US Keyboard's tilda and space characters are missing from the example. Here is the complete list: Regex.Replace(strIn, @"[^a-zA-Z0-9~`!@#$%^&*()_+|\-=\\{}\[\]:"";'<>?,./ ]", "") – Jeson Martajaya Aug 14 '15 at 19:26