0

I am in the process of validating a form that I will be using on my website in order to obtain certain details about a person of business that is registering an online account with us.

I am writing this question in order to obtain a bit of advice on how the validation the following types of information correctly.

In obtain to explain this, I will list a series of data types along with the html validation I had in mind. This could then be reused in a series of php validations amoung other things in order to ensure that the form is always validated correctly, however the standard html validation in my opinion looks better than anything I have been able to achieve by applying my own css.

First Names - ^[a-zA-Z -]{1,120} (a-z, from 1 to 120 characters long, big or small letters)

Last Names - ^[a-zA-Z -]{1,120}

Email Addresses - ^([a-zA-Z0-9_\\-\\.]+)@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\\]?)$ (validation including .com and .co.za domains which is what is mainly used)

If anyone has any suggestions relating to bettering these validation patterns or perhaps some others that are of more standard in use, that info would be greatly appreciated.

Also any information relating to why they should be or should not be used would be great too.

Thanks!!

Craig van Tonder
  • 7,497
  • 18
  • 64
  • 109
  • Not validating the address at all can permit mail injection. Such as writing your email address as \r\ncc: spam@spamemrs.com – Mihai Stancu Jun 17 '12 at 12:11
  • @Ben Wow, I never actually thought about it that way... Kind of a double job seeing as though I am requiring them to click on a link in a validation email. Would you care to explain more about the unicode characters? – Craig van Tonder Jun 17 '12 at 12:11
  • *How* you should validate your names depends on your requirements. Simply a-zA-Z for names would be way inappropriate for the typical names I am usually dealing with. Whether that's enough for you or not is up to you and your business. – deceze Jun 17 '12 at 12:12
  • @Mihai, Thanks :) What kind of validation would you then suggest? – Craig van Tonder Jun 17 '12 at 12:12
  • @Deceze well its more than likely not good enough, hence the reason for my question. Would you are to explain why this would be inappropriate? – Craig van Tonder Jun 17 '12 at 12:13
  • I was referring to @Ben who said that sending the "confirmation email" is validation enough. In my opinion you should validate the email (like you did above) to prevent email injection and to make sure the email is valid. And then send a "confirmation email". – Mihai Stancu Jun 17 '12 at 12:14
  • @Mahai yes I caught that one, thanks for your input in that regard. And too this is what I actually have in place at the moment, so its good to know that in coming to this conclusion it was correct. – Craig van Tonder Jun 17 '12 at 12:16
  • Inappropriate because there are many people on this earth whose name contains letters outside the A-Z range. Starting with simple names likes André and going to 太郎. Whether you can and/or want to consider those depends on you. – deceze Jun 17 '12 at 12:18
  • @deceze how would i include names like André? im not too worried about the chinese characters because chances are if they are chinese I wont know how to deal with them anyway :) – Craig van Tonder Jun 17 '12 at 12:19
  • Well first define what you want to allow and what you don't. – deceze Jun 17 '12 at 12:20
  • @deceze I would like to include the latin characters like é? but not the chinese ones as used by you. I really don't think there is much point in that based on my region. Also is there anything else you could suggest in this regard? – Craig van Tonder Jun 17 '12 at 12:22
  • For example: http://stackoverflow.com/a/6548859/476 – deceze Jun 17 '12 at 12:31

1 Answers1

2

Your "validation" of names excludes all languages that don't use the Latin Alphabet. Why? I guess you could check that there aren't any numbers in there and leave it at that. If you want people without Latin names to be able to use your site then your (database?) should be in a character set such as UTF-8 and you'll have to allow everything. Even trying to remove rude words can result in the scunthorpe problem.

Don't validate e-mails using regular expressions. Mail / ping the address and get the person to click on a link. It is technically impossible to validate an e-mail address using regexes and the better ones that have been developed can be ridiculous. Non-latin domain names exist and as with names you can't use the Latin alphabet to ensure that they contain what you want.

Also, as ICANN are currently selling off some new gTLDs, that will substantially increase the available name-space you're never going to be able to guarantee that something actually exists without checking.

Obviously, if you're using a database, use prepared statements to stop SQL Injection.

Community
  • 1
  • 1
Ben
  • 51,770
  • 36
  • 127
  • 149
  • As stated in my comments, you should check the email address does not contain malicious characters such as those used as control characters in the email format. – Mihai Stancu Jun 17 '12 at 12:15
  • It is practically impossible to validate an email according to RFC *using a regular expression*. It's perfectly possible to validate it using some more code... Though being perfectly RFC compliant is pretty tough... – deceze Jun 17 '12 at 12:16
  • Asides from that you can in fact check if an email **really** exists by issuing a SMTP request and asking if the email box is valid. – Mihai Stancu Jun 17 '12 at 12:16
  • @deceze, yes by mailing it, or pinging it as Mihai says. – Ben Jun 17 '12 at 12:17
  • I actually filter the info though a function that checks the domain to see if it exists too, so I think the combination of things should get most of the junk out? – Craig van Tonder Jun 17 '12 at 12:18
  • @BlackberryFan, now that ICANN are allocating new TLDs and allowing non-latin ones as well that's going to get more and more ridiculous, you're back to pinging again. – Ben Jun 17 '12 at 12:19
  • I agree that you **should** either ping the box or even better email the user and expect the confirmation link. But before doing any of those you should make sure the address is *at least* composed only of [a-zA-Z0-9-_.@] to make sure no email injection is performed. – Mihai Stancu Jun 17 '12 at 12:20
  • @mahai are you suggesting that the simple validation as stated is sufficient when used in combination with other methods of validation? – Craig van Tonder Jun 17 '12 at 12:22
  • @Mihai Email addresses are allowed to be much more complex than that. They're rare in practice, but the RFC defines more characters than [a-zA-Z0-9-_.@]. – deceze Jun 17 '12 at 12:22
  • @MihaiStancu, and how do you deal with non-latin domains? – Ben Jun 17 '12 at 12:22
  • I'm not all for supporting unlikely scenarios unless I have a realizable growth target (local/european business going international). But to answer your questions if it is indeed mandatory to allow the exceptions as well then I would check to see if `\r\n` characters are present in the email string (or other email specification control characters) and only after i an satisfied that the email address supplied is not an injection string i would ping/email+confirm. – Mihai Stancu Jun 17 '12 at 12:26
  • The problem with pinging is the lag, which even if it's small it'll be a duration added to the server-processing time. I wouldn't go overboard and ping the domain, then ping the email box, then send the email. If I feel my users can handle the hassle of a confirmation link I just do that, otherwise I just ping the email box and validate instantly. – Mihai Stancu Jun 17 '12 at 12:29
  • @MihaiStancu, yes just ping the e-mail box. There's no point pinging the domain first. – Ben Jun 17 '12 at 12:30
  • There are caveats with pinging the box, not all email hosts are well configured and may not yield a positive reply when the user supplies an alias email (that works) but the email host doesn't recognize the box. Such as my email mihai.stancu@... vs my email alias ms@... – Mihai Stancu Jun 17 '12 at 12:30
  • @Ben btw I don't know who downvoted your answer but it has a ton of really good info so in my opinion it was too hasty of a vote... – Craig van Tonder Jun 17 '12 at 12:32
  • @BlackberryFan, it doesn't matter. All it means is that someone disagreed with what I have written. There's nothing wrong with that! – Ben Jun 17 '12 at 12:43
  • Oh another thing to consider, if you're on the registration or the login page, it wouldn't be a bad thing to do as many time consuming checks as you wish, as a matter of slowing down the brute force attacks you can even set a sleep period that grows larger with each unsuccessful attempt, kind of like the 1 hour temp-ban due to 5 failed log-ins just distributed over time :D – Mihai Stancu Jun 17 '12 at 12:46
  • @MihaiStancu, that's quite a good idea. Never considered doing it like that before; though, it would annoy legitimate users. – Ben Jun 17 '12 at 12:54
  • You implement it only after the second or third attempt so it wouldn't annoy legitimate users who didn't forget their passwords, and if they lost their password well... tough it's still better than receiving a message like "you've been banned for an hour for 5 failed attepts". Good side is that it would keep the abusing server's resources active and occupied for the entire delay. Bad side is it keeps your server's resources active and occupied for the entire delay. – Mihai Stancu Jun 17 '12 at 12:58
  • @Mihai Thank you so much for your good ideas and good insight too! – Craig van Tonder Jun 17 '12 at 20:26
  • @Ben after much digging through the links that you posted I found this: http://www.linuxjournal.com/article/9585?page=0,3 It seems to be a great function to deal with this task and I have decided to make use of it for now. – Craig van Tonder Jun 17 '12 at 21:56
  • @BlackberryFan, it's a good article and looks to be a good function for Latin addresses. Re-read the first sentence of the article though :-). – Ben Jun 17 '12 at 23:20