16

I have a site where users can pick a username. Currently, they can put in almost any characters including things such as @ ! # etc.

I know I can use a regex, and that's probably what I'm opting for.

I'll be using a negated set, which I'm assuming is the right tool here as so:

[^@!#]

So, how can I know all of the illegal characters to put in that set? I can start manually putting in the ones that are obvious such as !@#$%^&*(), but is there an easy way to do this without manually putting every single one of them in?

I know a lot of sites only allow strings that contain alphabets, numbers, dashes, or underscores. Something like that would work well for me.

Any help would be greatly appreciated.

Thanks S.O.!

Isaiah Lee
  • 647
  • 3
  • 9
  • 18
  • 2
    If you know what you want to include (alphanumeric + hyphens + underscores) why are you using a negated set? – univerio Jun 25 '14 at 21:42

4 Answers4

31

Instead of using negation, place only what you want to allow inside of your character class.

^[a-zA-Z0-9_-]*$

Explanation:

^                 # the beginning of the string
 [a-zA-Z0-9_-]*   #  any character of: 'a' to 'z', 'A' to 'Z', 
                  #  '0' to '9', '_', '-' (0 or more times)
$                 # before an optional \n, and the end of the string
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • thanks for the REGEX. It definitely makes more sense to only write an inclusive set for this scenario! :) – Isaiah Lee Jun 25 '14 at 21:55
  • 1
    @hwnd - looks good but i'm wondering why '/' character is not also considered a safe character in a url - seems fundamental in paths – Reinsbrain Feb 19 '16 at 18:41
  • @Reinsbrain The question states that the Regex is checking a username. My guess is the username is being used as part of the URL such as www.example.com/[username]/settings. Allowing '/' makes sense when checking for a full valid URL, but not when checking if a string can be used as part of a URL. To understand why imagine a user with a username ending or starting with a '/'. – Gino Sep 14 '20 at 04:27
  • 3
    RFC-3986 says ALPHA DIGIT "-" / "." / "_" / "~" so for completeness this should be: `^[a-zA-Z0-9._~-]*$` – Ben Golding Jan 06 '22 at 00:20
  • 1
    To add to this great answer (and comments), `\w` (word character) is a nice shorthand for `[a-zA-Z0-9_]`, so I now use: `^[\w.~-]*$`. Thanks all for your help! – Joel Balmer Mar 28 '22 at 14:15
3

Instead of denying values, maybe it's better to only allow some

[:word:] -- Digits, letters and underscore

Check this chart

http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

brunofitas
  • 2,983
  • 1
  • 20
  • 26
2

One of the reasons you'll want to use an inclusive set is that limiting bad characters is very difficult with all the Unicode variants out there. Characters such as ß, ñ, oœ, æ will probably give you a headache. If you limit the username to just a subset of letters that YOU provide, you can easily chop out everything else that you may not want in there.

OnlineCop
  • 4,019
  • 23
  • 35
1

All the answers on this question seem to assume English language. To allow for Unicode characters (so people can have URLs / user names in their native language), it is better to use a blacklist of reserved / unsafe characters rather than a whitelist of characters.

Here is a regex that matches characters which are generally unsafe in a URL:

([&$\+,:;=\?@#\s<>\[\]\{\}[\/]|\\\^%])+

Link to test RegEx

(list based on unsafe characters mentioned in this answer)

CleverPatrick
  • 9,261
  • 5
  • 63
  • 86