42

Although this seems like a trivial question, I am quite sure it is not :)

I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:

^[a-z -']+$

However, I need to support also these cases:

  • other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
  • different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
  • no numbers or symbols or unnecessary punctuation or runes, etc..
  • titles, middle initials, suffixes are not part of this data
  • names are already separated by surnames.
  • we are prepared to force ultra rare names to be simplified (there's a person named '@' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
  • note that many countries have laws about names so there are standards to follow

Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?

I would be looking for something similar to the many "email address" regexes that you can find on google.

Sklivvz
  • 30,601
  • 24
  • 116
  • 172
  • 9
    http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ Might want to take a good read here. – epochwolf Nov 24 '11 at 06:32
  • 2
    I doubt that this is feasible - there are just to much Unicode symbols to exclude all unwanted symbols (and how will tell you what Chinese symbols to exclude?) and there are surly to many valid symbols to inlcude them all (and you will have Chinese symbols problem again). I would not put any constraints on a user name - it may even contain numbers; think of aristocratic names. – Daniel Brückner May 20 '09 at 16:15
  • Maybe this regex: `[[:alpha:]-]/u` – Black May 22 '19 at 11:11

13 Answers13

47

I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.

Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.

Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.

Shog9
  • 156,901
  • 35
  • 231
  • 235
Chris Cudmore
  • 29,793
  • 12
  • 57
  • 94
  • 15
    Actually I'd allow Bobby to enter his name; I'd just make sure it was escaped before I sent it to the database. Similarly I'd allow Mr "> to have his given name, and I'd escape it before sending it to the browser. I'd only sanitise the input if I thought my colleagues might screw up the escaping. – user9876 May 20 '09 at 16:16
  • 7
    @Skliwz — Then that's what you need to address. If they don't properly escape when inserting into SQL, any name with an apostrophe (which your original question already recognizes as necessary) opens you up to security vulnerabilities. Imagine trying to authenticate a user nameed "Foo'or True Or'foo" — no "dangerous" characters, but there goes your login scheme. – Ben Blank May 20 '09 at 16:28
  • 1
    If all you're doing is reading and writing to the db, then properly parameterizing queries should take care of the problem. However, if you're ever going to dynamically execute code from the db, then you have to be careful (such as using the exec() statement. – Chris Cudmore May 20 '09 at 17:54
  • 1
    How is it a bogus answer? Any answer to this specific question would be speculation at best. – Qix - MONICA WAS MISTREATED Aug 02 '13 at 03:41
  • 3
    I think that the assumption that every website must accommodate every possible name is fallacious. People with weird names are used to not being able to use them everywhere. My last name is too long to fit on many credit cards and government forms, so I just truncate it. Often the hyphen is dropped. No biggie. The app I'm working on now has a thousand users a month who enter e-mail addresses in the first- or last-name field when signing up. Some people might legitimately have an "@" in their names, but this number is minuscule compared with the number who simply are making an error. – Patrick Brinich-Langlois Dec 02 '14 at 18:18
  • 4
    @PatrickBrinich-Langlois: Which is all well and good until you can’t board a plane or make a bank transfer because of it (both of which have happened to me on account of apostrophe mishandling). – Ry- Sep 26 '19 at 17:56
19

I'll try to give a proper answer myself:

The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.

Regarding numbers, there's only one case with an 8. I think I can safely disallow that.

Regarding letters, any letter is valid.

I also want to include space.

This would sum up to this regex:

^[\p{L} \.'\-]+$

This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.

So the validation code should be something like this (untested):

var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$")) 
    throw new ArgumentException("nameParam");
name = name.Replace("'", "'");  //' does not work in IE

Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?


complete tested solution

using System;
using System.Text.RegularExpressions;

namespace test
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            var names = new string[]{"Hello World", 
                "John",
                "João",
                "タロウ",
                "やまだ",
                "山田",
                "先生",
                "мыхаыл",
                "Θεοκλεια",
                "आकाङ्क्षा",
                "علاء الدين",
                "אַבְרָהָם",
                "മലയാളം",
                "상",
                "D'Addario",
                "John-Doe",
                "P.A.M.",
                "' --",
                "<xss>",
                "\""
            };
            foreach (var nameParam in names)
            {
                Console.Write(nameParam+" ");
                var name = nameParam.Trim();
                if (!Regex.IsMatch(name, @"^[\p{L}\p{M}' \.\-]+$"))
                {
                    Console.WriteLine("fail");
                    continue;
                }
                name = name.Replace("'", "&#39;");
                Console.WriteLine(name);
            }
        }
    }
}
Sklivvz
  • 30,601
  • 24
  • 116
  • 172
  • 24
    Sorry, you're still going to leave valid names out in the cold. I strongly suggest you read up on diacritics in Arabic, especially those are separate Unicode characters but which combine with letters to change them. Will you be disallowing things like "John W. Saunders, 3rd"? I hope not. It's just a much wider world out there than you seem to realize, and your simplistic, Western-oriented rules will simply not work in general. – John Saunders May 21 '09 at 01:30
  • 4
    Hi John, the regex does support diacritics (arabic is also in the test cases) with the \p{M}. Moreover, I am only validating names, i.e. in your example those would be "John W." (or "John" and "W.") and "Saunders". "," is not part of the name and "3rd" is a suffix. – Sklivvz May 21 '09 at 06:11
  • You expect the users to enter FirstName, LastName, Suffix???? Or will you also have Prefix, MiddleName1, MiddleName2 .... There's another Question about names that discussed these issues extensively. – Osama Al-Maadeed May 24 '09 at 00:15
  • 7
    People coming in from Saint-Louis-du-Ha!_Ha! would be upset. http://en.wikipedia.org/wiki/Saint-Louis-du-Ha!_Ha!,_Quebec – Chris Cudmore Jun 22 '12 at 17:04
  • The fact that there are weirdos with unlikely names does not make this unuseful in the least... You are taking this way, way too literally! YAGNI – Sklivvz Jun 22 '12 at 17:25
  • Is there a way to write this regex in the javascript regex engine? – Piotr Tomasik Oct 19 '12 at 08:28
  • 5
    [श्री खनाल is not happy with you.](http://meta.stackexchange.com/questions/171814/display-name-in-local-language) Seriously, read http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ and trade your beliefs for a [saner](http://stackoverflow.com/a/888902) [approach](http://stackoverflow.com/a/888870). – Gilles 'SO- stop being evil' Mar 14 '13 at 12:27
  • I disagree, just because you can't think of valid uses it doesn't mean there are any. For example sanitizing data. – Sklivvz Mar 14 '13 at 13:21
  • Sklivvz pointed out that "3rd" would not work, but what is wrong with saying III instead of 3rd? – Sarah Weinberger Jun 18 '14 at 19:13
  • Can you sync your code snippets together? It's extremely confusing with 3 different snippets of code. – nhahtdh Jul 21 '15 at 09:19
  • "*The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.*" Mark Rein•Hagen - one of the creators of the World of Darkness games. Swery65 - Japanese game designer - not his official name but operates under it. [FM-2030](https://en.wikipedia.org/wiki/FM-2030) an author- it's his official legal name. I've seen names that use an [en- or em-dash](https://en.wikipedia.org/wiki/Dash) which are *not* the hyphen/minus sign. Some people also use a comma in their name "Fred Bloggs, Ph.D". – VLAZ Nov 20 '19 at 23:53
  • @Sklivvz FM-2030 was changed literally *legally* - as such, it's used for official and administrative purposes. I don't see why it's not a legal name at that point. Regex doesn't deal with *legality* anyway but patterns. The pattern you've defined excludes legal names from being entered and I don't think that as a developer of a system it's good to make users "guilty until proven innocent" (so to say) on matters of personal information. Since the pattern is based on only what you've seen, I've put offered more information to draw from for the future. – VLAZ Nov 21 '19 at 13:37
  • @Sklivvz if legallity was *really* an issue here, then this regex is even more unfit than differentiating between FM-2030 and Fred Bloggs. What if somebody puts down their name as "John" but their legal name is "Bob"? The former is also not a legal name, yet the proposed system doesn't do anything about the case, while it will (somewhat correctly) flag Sweary65 as invalid. – VLAZ Nov 21 '19 at 13:39
  • @VLAZ OK but it's still one weirdo. If you think that changes anything in this, you probably are treating this as a matematical problem. It is not. It's a way of making a large set of data better -- in that context the plead of FM-2030 is really unimportant. – Sklivvz Nov 21 '19 at 18:46
  • @Sklivvz I treat it as a *human* problem. This system will have direct repercussions of people who use it. At least I assume so - otherwise, it's sort of useless to enter names there if there is no result. I've suffered enough due to my name and it doesn't even have weird characters in it. I've also talked with and heard from enough people with names slightly outside the norm who've shared their troubles related to something they rarely control yet are somehow "at fault" for some developer's oversight. And if somebody does *legally* take an unorthodox name apparently they are weirdos to boot. – VLAZ Nov 21 '19 at 18:58
  • OK, let's say that we can accept a reasonable set of outliers (single people with really weird names). Can we move on from this point? – Sklivvz Nov 22 '19 at 07:06
  • The two escapings in the class are completely unnecessary - using those indicates a questionable understanding of regexes - it should be `^[\p{L} .'-]+$` right away. – AmigoJack Mar 05 '22 at 09:49
16

I would just allow everything (except an empty string) and assume the user knows what his name is.

There are 2 common cases:

  1. You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
  2. You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.

In case (1), you can allow all characters because you're checking against a paper document.

In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".

user9876
  • 10,954
  • 6
  • 44
  • 66
  • 4
    +1 Using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name – kscott May 20 '09 at 16:23
  • Are there names with emojis? – Cœur Aug 10 '17 at 16:14
  • 1
    @Cœur AFAIK not yet, but some "trendy" parent is bound to inflict that on their child eventually... – user9876 Sep 08 '17 at 17:02
  • The problem here is that in Chrome/gmail if the customer enters a name such xxx.yyy.com the browser forces it to be a clickable link and making it a possible attack vector. The question becomes how do you alleviate that attack surface? – Mindfulgeek Feb 08 '21 at 18:35
13

I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like @#$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name.

EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on its own. However, this article has a section on filtering that is a starting point if you want to go that route:

s/[\<\>\"\'\%\;\(\)\&\+]//g;

"Secure Programming for Linux and Unix HOWTO" by David A. Wheeler, v3.010 Edition (2003)

v3.72, 2015-09-19 is a more recent version.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
kscott
  • 1,866
  • 3
  • 25
  • 41
  • 1
    Well beyond sanitizing the input, I don't see a reason to eliminate any characters. What are you trying to prevent? – kscott May 20 '09 at 16:14
  • Any characters that you can be sure wont end up in a name. Since people really can be named anything nothing is safe to some extent. But I think the examples given by kscott !@#$%^ are a good place to start. You could easily run a large name list through your expression when your done and see what falls out (if any). +1 – Copas May 20 '09 at 16:14
  • 1
    No regex is going to prevent a cross site scripting attack – kscott May 20 '09 at 16:36
  • I think you're answering your own question, Skliwz, you're not going to find a regex that covers all unicode characters and prevents cross site scripting. If stopping XSS was a simple as finding a magic regex, a lot of us would be out of jobs. – kscott May 20 '09 at 16:49
  • 1
    Then you need to decide what you're trying to prevent. If its XSS you only need to stop malicious characters, and even then you need to be doing more. If its people from entering names you don't like, then you're SOL you'll never get a regex that handles every name in every culture. Heck, you probably could even get one that handled American Hippies. – kscott May 20 '09 at 17:09
  • Why would anyone escape all this in a regex class? It should be `/[<>"'%;()&+]/` right away if a context scope (Unix shell, C++ String literal, JS...) remains unspecified. – AmigoJack Mar 05 '22 at 10:35
7

BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?

As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • 2
    Try to do what? Do you know the rules for naming in those languages? Do you know how to distinguish between a first name and last name in those languages? Don't parse the names at all - just accept that people know their names. – John Saunders May 20 '09 at 16:56
  • 9
    because validating a name is not how you prevent cross site scripting. you allow the users to put whatever they want in the field, since names are crazy and there are a lot of unicode characters in the world, then you treat whatever anyone puts in that field like its radioactive. – kscott May 20 '09 at 16:57
6

I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • If you want to print user input into an HTML document, escape the HTML meta character (`&`, `<`, `>`, `"` and `'`). If you want to print user input into a JavaScript string declration, escape the JavaScript string meta characters (`\`, `"` and `'`). If you want to print user intput into a JavaScript string declaration inside an HTML document, first escape the JavaScript string meta characters, then the HTML meta characters. If you want to use user input in a SQL string declaration, escape the SQL string meta characters. Do you see the pattern? – Gumbo May 20 '09 at 18:14
  • 4
    It’s YOUR job to do that on the server side and not the client’s. Remember: Never trust user data! – Gumbo May 20 '09 at 18:40
  • 4
    Well it seem’s that you didn’t understand what XSS exactly is or what its fundamental flaw is. It’s changing from one context, in which a certain value is considered as safe, into another, in which the same isn’t considered as safe. And that change is initiated by the value itself as it contains particular character sequences that mark the end of the one and the start of the other context. Just like the `"` marks the end/begin of a string declaration. Now if you want to put a string into another string declration, you need to escape those character sequences to get them be treated as literals. – Gumbo May 21 '09 at 07:11
  • So it suffices if you just escape the language and context dependent meta characters (those with the special meaning in that language and context) to get them be treated as literals and not as meta characters. – Gumbo May 21 '09 at 07:13
5

You could use the following regex code to validate 2 names separeted by a space with the following regex code:

^[A-Za-zÀ-ú]+ [A-Za-zÀ-ú]+$

or just use:

[[:lower:]] = [a-zà-ú]

[[:upper:]] =[A-ZÀ-Ú]

[[:alpha:]] = [A-Za-zÀ-ú]

[[:alnum:]] = [A-Za-zÀ-ú0-9]

  • 6
    This regex expression will miss something like "Laura E. Ingalls" or "Laura Elisabeth Ingalls Wilder" or "Laura Elisabeth Ingalls-Wilder". – Sarah Weinberger Jun 18 '14 at 19:07
2

This one worked perfectly for me in JavaScript: ^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$

Here is the method:

function isValidName(name) {
    var found = name.search(/^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$/);
    return found > -1;
}
Ákos Kovács
  • 502
  • 1
  • 10
  • 23
user2288580
  • 2,210
  • 23
  • 16
2

A very contentious subject that I seem to have stumbled along here. However sometimes it's nice to head dear little-bobby tables off at the pass and send little Robert to the headmasters office along with his semi-colons and SQL comment lines --.

This REGEX in VB.NET includes regular alphabetic characters and various circumflexed european characters. However poor old James Mc'Tristan-Smythe the 3rd will have to input his pedigree in as the Jim the Third.

<asp:RegularExpressionValidator ID="RegExValid1" Runat="server"
                    ErrorMessage="ERROR: Please enter a valid surname<br/>" SetFocusOnError="true" Display="Dynamic"
                    ControlToValidate="txtSurname" ValidationGroup="MandatoryContent"
                    ValidationExpression="^[A-Za-z'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$">
Timi
  • 31
  • 2
  • 2
    This example does not like "Laura E. Wilder", basically the period. A simple fix would be to add the period, so "^[A-Za-z.'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$" – Sarah Weinberger Jun 18 '14 at 19:11
2

It's a very difficult problem to validate something like a name due to all the corner cases possible.

Corner Cases

Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.

If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.

Trampas Kirk
  • 1,436
  • 3
  • 16
  • 21
0

Steps:

  1. first remove all accents
  2. apply the regular expression

To strip the accents:

private static string RemoveAccents(string s)
{
    s = s.Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.Length; i++)
    {
        if (CharUnicodeInfo.GetUnicodeCategory(s[i]) != UnicodeCategory.NonSpacingMark) sb.Append(s[i]);
    }
    return sb.ToString();
}
Martin Staufcik
  • 8,295
  • 4
  • 44
  • 63
  • Will this work with Arabic, Hebrew, Chinese, Tamil, Cree, or Cyrillic names? – TRiG Jan 04 '19 at 12:15
  • NormalizationForm.FormD indicates that a Unicode string is normalized using full canonical decomposition. I am not sure what that means. Maybe this is a topic for a separate question? The solution in thi answer is tested for latin characters. – Martin Staufcik Jan 04 '19 at 13:45
-2

This somewhat helps:

^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$

MT.
  • 1,915
  • 3
  • 16
  • 19
  • 1
    This answer came years after the accepted answer and doesn't consider any of the less-common-in-America cases. – moopet Oct 09 '18 at 12:01
-3

This one should work ^([A-Z]{1}+[a-z\-\.\']*+[\s]?)* Add some special characters if you need them.

Todor Todorov
  • 2,503
  • 1
  • 16
  • 15
  • This answer came years after the accepted answer and doesn't consider any of the less-common-in-America cases. – moopet Oct 09 '18 at 12:01