14

I've run into a bit of a problem with a Regex I'm using for humans names.

$rexName = '/^[a-z' -]$/i';

Suppose a user with the name Jürgen wishes to register? Or Böb? That's pretty commonplace in Europe. Is there a special notation for this?

EDIT:, just threw the Jürgen name against a regex creator, and it splits the word up at the ü letter...

http://www.txt2re.com/index.php3?s=J%FCrgen+Blalock&submit=Show+Matches

EDIT2: Allright, since checking for such specific things is hard, why not use a regex that simply checks for illegal characters?

$rexSafety = "/^[^<,\"@/{}()*$%?=>:|;#]*$/i";

(now which ones of these can actually be used in any hacking attempt?)

For instance. This allows ' and - signs, yet you need a ; to make it work in SQL, and those will be stopped.Any other characters that are commonly used for HTML injection of SQL attacks that I'm missing?

KdgDev
  • 14,299
  • 46
  • 120
  • 156
  • 5
    Just don’t validate that datum. – Gumbo Aug 11 '09 at 16:04
  • I've been wondering about this too... – Meep3D Aug 11 '09 at 16:04
  • 2
    I agree with @Gumbo, there's probably not a good reason to validate the characters in a name. A more appropriate solution might be to run the field against a blacklist regular expression, rather than trying to accept a whitelist of valid characters. What happens when 陳 tries to submit your form? Are you going to have a regular expression with every single international character in it? :) – Rob Hruska Aug 11 '09 at 16:09

4 Answers4

22

I would really say : don't try to validate names : one day or another, your code will meet a name that it thinks is "wrong"... And how do you think one would react when an application tells him "your name is not valid" ?

Depending on what you really want to achieve, you might consider using some kind of blacklist / filters, to exclude the "not-names" you thought about : it will maybe let some "bad-names" pass, but, at least, it shouldn't prevent any existing name from accessing your application.

Here are a few examples of rules that come to mind :

  • no number
  • no special character, like "~{()}@^$%?;:/*§£ø and probably some others
  • no more that 3 spaces ?
  • none of "admin", "support", "moderator", "test", and a few other obvious non-names that people tend to use when they don't want to type in their real name...
    • (but, if they don't want to give you their name, their still won't, even if you forbid them from typing some random letters, they could just use a real name... Which is not their's)

Yes, this is not perfect ; and yes, it will let some non-names pass... But it's probably way better for your application than saying someone "your name is wrong" (yes, I insist ^^ )


And, to answer a comment you left under one other answer :

I could just forbid the most command characters for SQL injection and XSS attacks,

About SQL Injection, you must escape your data before sending those to the database ; and, if you always escape those data (you should !), you don't have to care about what users may input or not : as it is escaped, always, there is no risk for you.

Same about XSS : as you always escape your data when ouputting it (you should !), there is no risk of injection ;-)


EDIT : if you just use that regex like that, it will not work quite well :

The following code :

$rexSafety = "/^[^<,\"@/{}()*$%?=>:|;#]*$/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will get you at least a warning :

Warning: preg_match() [function.preg-match]: Unknown modifier '{'

You must escape at least some of those special chars ; I'll let you dig into PCRE Patterns for more informations (there is really a lot to know about PCRE / regex ; and I won't be able to explain it all)

If you actually want to check that none of those characters is inside a given piece of data, you might end up with something like that :

$rexSafety = "/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

(This is a quick and dirty proposition, which has to be refined!)

This one says "OK" (well, I definitly hope my own name is ok!)
And the same example with some specials chars, like this :

$rexSafety = "/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'ma{rtin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will say "bad name"

But please note I have not fully tested this, and it probably needs more work ! Do not use this on your site unless you tested it very carefully !


Also note that a single quote can be helpful when trying to do an SQL Injection... But it is probably a character that is legal in some names... So, just excluding some characters might no be enough ;-)

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • Yes, it will be escaped... but still entered into the database. I wouldn't like it if there were a couple hundred profiles on my website displaying nothing but a bunch of SQL code... – KdgDev Aug 11 '09 at 16:31
  • In this case, it might be interesting to add some words like "select", "update", "delete", "where", "order by" and such stuff to the blacklist of forbidden words ; afterall, it is almost certain that they are not used in names ;-) ; And you might also want to ensure that a user cannot register too many times (a -- not necessarily the best one -- quite basic idea might be to set a limit on the number of registrations that can come from a single IP adresse in one hour, for instance) – Pascal MARTIN Aug 11 '09 at 16:36
  • Updated the original post with rexSafety variable. – KdgDev Aug 11 '09 at 16:48
  • Perhaps a better question to ask myself is: which characters do hackers ALWAYS need? For instance, I can allow the single quote and minus sign, but I will forbid = @ and ; The idea being a string meant to get past the security, will never be a single character. So it's a process of elimination: what is commonplace in human names and what is not. I don't need to forbid the ' character, since it will always be in the company of a @ or = sign. That's not 100% true, but I hope you see what I'm getting at. – KdgDev Aug 12 '09 at 03:52
  • The only problem with not allowing symbols: http://en.wikipedia.org/wiki/Prince_%28musician%29 – Thomas Owens Aug 14 '09 at 12:10
7

PHP’s PCRE implementation supports Unicode character properties that span a larger set of characters. So you could use a combination of \p{L} (letter characters), \p{P} (punctuation characters) and \p{Zs} (space separator characters):

/^[\p{L}\p{P}\p{Zs}]+$/

But there might be characters that are not covered by these character categories while there might be some included that you don’t want to be allowed.

So I advice you against using regular expressions on a datum with such a vague range of values like a real person’s name.


Edit   As you edited your question and now see that you just want to prevent certain code injection attacks: You should better escape those characters rather than rejecting them as a potential attack attempt.

Use mysql_real_escape_string or prepared statements for SQL queries, htmlspecialchars for HTML output and other appropriate functions for other languages.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • I don't only want to prevent code injection, I don't even want escaped code to ever enter the database. As I already stated: imagine a user-profile website(myspace for instance). Imagine coming across a profile ridden with SQL injections. All of them escaped... What the hell kind of service is that? Why would I allow hackers to fill my database with useless dribble like that, when the only thing they're trying to do is hack my website? – KdgDev Aug 11 '09 at 16:44
4

That's a problem with no easy general solution. The thing is that you really can't predict what characters a name could possibly contain. Probably the best solution is to define an negative character mask to exclude some special characters you really don't want to end up in a name.

You can do this using:

$regexp = "/^[^<put unwanted characters here>]+$/

sebasgo
  • 3,845
  • 23
  • 28
  • So if I can't predict the characters, wouldn't it be better to use a regex that disallows things instead one that allows things? I could just forbid the most command characters for SQL injection and XSS attacks, which would allow things like ü. – KdgDev Aug 11 '09 at 16:12
  • 6
    No, don't filter for SQL keywords and similar things. That's extremely bad coding style. Instead, escape data properly. Use mysql_realescape() to prevent SQL injections and htmlentities() for XSS attacks. – sebasgo Aug 11 '09 at 16:16
  • Yes, sebasgo is right on. This is a waste of your time if you're trying to prevent SQL injections. Use functions designed for this purpose, don't reinvent the wheel :P – hobodave Aug 11 '09 at 16:20
  • "That's extremely bad coding style". That makes no sense. Adding a couple of characters to a blacklisting regex can't be called "extremely bad coding style". I can understand if you favor using the functions you mentioned, but don't go saying that people who do different have an "extremely bad coding style" – KdgDev Aug 11 '09 at 16:20
  • 5
    If you filter SQL keywords, the poor Bobby Tables will not be able to attend school. – Stefano Borini Aug 11 '09 at 16:24
  • Also, the mysql_real_escape_string() function simply turns potentially harmful code into a dud. It makes it harmless... but it's still entered into the database and I don't want that. Imagine that on a profile site where a user profile display a whole bunch of SQL code... – KdgDev Aug 11 '09 at 16:25
  • I read XKCD, Stefano.Also, I'm not filtering keywords at all, I'm filtering symbols. A hacker can put in as many keywords as he/she wants, if the ; symbol is found in the string, then it won't matter, it'll be refused. – KdgDev Aug 11 '09 at 16:28
2

If you're trying to parse apart a human name in PHP, I recomment Keith Beckman's nameparse.php script.

Jonathon Hill
  • 3,445
  • 1
  • 33
  • 31