62

There are the standard A-Z, a-z characters, but also there are hyphens, em dashes, quotes, etc.

Plus, there are all of the international characters, like umlauts, etc.

So, for an English-based system, what's the complete set? What about sets for other languages? What about UTF8, UTF16, etc?

Bonus question: How many name fields are needed, and what are their maximum lengths?

EDIT: There are definitely two different types of characters involved in people's names, those that are there as part of the context, and those that are there for structural reasons. I don't want to limit or interfere with the context characters, but I do need to deal with the structural ones.

For example, I had a name come in that was separated by an em dash, but it was hard to distinguish that from the minus character. To make the system easier for searching, I want to take all five different types of dashes, and map them onto one unique character (minus), that way the searcher doesn't need to know specifically which symbol was initially entered.

The problem exists for dashes, probably quotes as well, but also how many other symbols?

Your Common Sense
  • 156,878
  • 40
  • 214
  • 345
Paul W Homer
  • 2,728
  • 1
  • 19
  • 25
  • Just saw in our system an em-dash in UTF8, that should have been mapped to an ASCII minus (otherwise you can't search on it). How big is my problem? – Paul W Homer Jan 07 '09 at 16:46
  • We're not it the position to restrict the input, so for hyphens, dashes and minuses for example, they'll have to be consistently mapped back to simple ASCII to make it all work properly. – Paul W Homer Jan 07 '09 at 16:47
  • You can't map all names to simple ASCII (you could map dashes & similar characters 'though). What about chinese, vitnamese and arabic names? – Joachim Sauer Jan 07 '09 at 16:57
  • Yes, I guess I should distinguish between "content characters" and "meta characters". I want to map identical meta-characters onto as single character, but not content ones. – Paul W Homer Jan 07 '09 at 17:05
  • This can easily lead to clashes: Paul Müller and his re-christened friend in the US Paul Muller might not be too happy. – innaM Jan 07 '09 at 17:36
  • We used to refer to this as the O'Brien fallacy – annakata Jan 07 '09 at 17:37
  • 3
    See [Falsehoods programmers believe about names](http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/). In short: all characters are valid as one's name is whatever they want it to be. – Andrew Marshall Oct 29 '12 at 03:52

10 Answers10

59

There's good article by the W3C called Personal names around the world that explains the problems (and possible solutions) pretty well (it was originally a two-part blog post by Richard Ishida: part 1 and part 2)

Personally I'd say: support every printable Unicode-Character and to be safe provide just a single field "name" that contains the full, formatted name. This way you can store pretty much every form of name. You might need a more structured storage, but then don't expect to be able to store every single combination in a structured form, as there are simply too many different ones.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • It makes sense, but I'm surprised, I always thought that full-name and salutation were older cruder conventions than first, middle and last. – Paul W Homer Jan 07 '09 at 17:07
  • 2
    @Paul: Perhaps so, when you're only dealing with a single culture and language. But when you get into different name orders, different numbers of names, etc., it can be hard for a human to follow all the possibilities, never mind a program, so just let the user deal with how to render his own name. – Dave Sherohman Jan 08 '09 at 02:06
  • 1
    Is there a definition for printable unicode characters? – lg.lindstrom Jan 19 '14 at 11:10
18

Whitelisting characters that could appear in a person's name is the wrong way to go, if you ask me. Sure, [A-Za-z] is a fair starting point, but, as you said, you get problems with "European" names. So you map all the umlauts, circumflexes and those. What about Chinese names? Japanese? Indian? Hebrew? You're entering a battle against wind turbines.

If you absolutely must check the validity of someone's name, I'd suggest doing a modest blacklist of certain characters. Braces, mathematical characters, some punctuation and such might be safe to ignore. But I'd be cautious, if I were you.

It might be best to just accept whatever comes in. UTF-16 should be today's overkill character set, that should be adequate for some years to come.

Edit: As for your question about name length and amount of names. If you really want people to write their real and complete names, I guess the only foolproof answer to both of those questions would be "infinite". Not being able to whip out any real examples for human beings, but surely there are analogous examples for humans as the native name for the city of Bangkok.

Henrik Paul
  • 66,919
  • 31
  • 85
  • 96
  • I would allow anything but whitespace – Ferruccio Jan 07 '09 at 16:54
  • 6
    But really, why? What if my surname was "von Rettig"? If I would index that surname, it would be written "Rettig, von". You can never be too sure... – Henrik Paul Jan 07 '09 at 16:57
  • I don't want to limit the input, but in certain cases I want to map several characters down onto a single one, for instance I want to map any long dashes in UTF8 onto minuses in ASCII, so searching is easier. – Paul W Homer Jan 07 '09 at 17:13
  • dashes to hyphens/minuses sound harmless enough to me. But were you thinking also of mapping Ä to A? In Nordic (in Northern Europe, that is) languages, those two characters are very different. In fact, A being the last character in the alphabet, Ä is the second to last in Finnish (and a few others). – Henrik Paul Jan 07 '09 at 17:18
  • I mean: "A being the _first_ character in the alphabet..." – Henrik Paul Jan 07 '09 at 17:19
  • 4
    Just to be a smartass: UTF-16 is not a character set but an encoding for characters of the Unicode set. You actually meant Unicode. (UTF-16 doesn't contain any more characters than UTF-8.) – Daniel Böhmer Feb 26 '19 at 23:07
13

I don't think there's a definitive answer. After all, some people have names that can't even be expressed in UTF-16...

Prince symbol

There are some odd people out there, who will give their kids the craziest of names, including putting in weird punctuation, accents that don't exist in their own language, etc.

However, you can place arbitrary restrictions on your database. If you want to you can insist on 7 bit ASCII names. It's slightly rude to users, but they'll live with it. It certainly makes searching easier.

My colleague's daughter is named Amélie. But even some (not all!) official British government web sites ("Please enter the name exactly as shown on the birth certificate") won't accept the unicode, so he has to use 'Amelie' instead.

Community
  • 1
  • 1
slim
  • 40,215
  • 13
  • 94
  • 127
  • 3
    AFAIK (and Wikipedia seems to agree with that) the symbol was never the legal name of Prince, but only the stage name ;-) – Joachim Sauer Jan 07 '09 at 16:59
  • 3
    The fact that the British government does it all wrong shouldn't keep anyone from doing it right. – innaM Jan 07 '09 at 17:04
  • 3
    I'm warped enough to name a child after Håkon Wium just to cause issues for said systems (that and in honour of his work with CSS) – Rowland Shaw Apr 01 '09 at 15:11
5

Any character that can be represented by any multiple of eight bits (greater than zero) is a possible character for a person's name. Lengths of both names and encodings are arbitrary, so no upper bound should be considered.

Just make sure you sanitize your database inputs so little Bobby Drop-tables doesn't get ya.

Max
  • 1,044
  • 10
  • 19
  • 2
    I'd prohibit at least all unprintable Unicode characters (control characters), as they are very, very, very unlikely to be contained in a name. – Joachim Sauer Jan 07 '09 at 17:03
  • Yes, but he didn't restrict his inputs to Unicode, did he? I'm trying to demonstrate that it's futile to try and guess what will and will not be a part of a valid name. You either place arbitrary restrictions on a name, and offend a few people later on, or you place no restrictions and sanitize. – Max Jan 08 '09 at 02:44
  • Upvoted just for the xkcd reference ;) – Guy Kogus Dec 18 '18 at 12:37
4

I'm making software for driving schools in the USA, so to me what matters most what the state DMV's accept as a proper name on a driver's license. In my case, it would cause problems to allow names beyond what the DMV allows, even if such names were legal because the same name must later be used for a driver's license.

From StackOverflow, I still hadn't confirmed the answer I needed. And I happen to know that in my state (Calif) they're using AS400's with software probably written in COBOL, and to the best of my knowledge, those only support an 8-bit character set. (Is it EBCDIC?) Anyway... Ugh.

So, I called the California DMV... Sure enough, their system allows A-Z and spaces and absolutely nothing else. Not even hyphens are allowed -- Hyphens are replaced with spaces. In fact, apparently just to be difficult, they only use capitals. And names such as "O'Malley" must be replaced with OMALLEY.

Leave it to government. I must say I'm thrilled not to be a developer working for DMV. (Although I could really use that kind of salary.)

PaulOTron2000
  • 1,115
  • 1
  • 13
  • 20
4

On the issue of name fields, the WRONG answer is first name, middle initial, last name, etc. for many reasons.

  1. Many people are known by their middle name, and formally use a first initial, middle name, last name format.

  2. In some cultures, the surname is the first name, and the given name is the last name.

  3. Multiple first and/or middle given names is getting more common. As @Dour High Arch points out, the other extreme is people with only one word in their name.

In an object-oriented database, you would store a Name object with methods to return a directory-style or signature-style name; and the backing store would contain whatever data was necessary to support those methods.

I haven't yet seen a relational database model that improves on the model of two variable-length strings for directory-style and signature-style names.

Ken Paul
  • 5,685
  • 2
  • 30
  • 33
2

It really depends on what the app is supposed to be used for.

Sure, in theory it's great if you allow every script on god's green earth to be used, but if the DB is also used by support staff, are they going to be able to handle names in Japanese, Hebrew and Thai script? Can you printer, if it's used to print postage labels?

You might add an extra field "Latin Transcription", but IMO it's really OK to restrict it to ISO-8859-1 characters - People who don't use Latin characters are by now so used to having to use a transcription that they don't mind it anymore, unless they're hardcore nationalists.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
0

UTF-8 should be good enough, as far as name fields, you'll want at minimum a first name and last.

chills42
  • 14,201
  • 3
  • 42
  • 77
0

Depending on the complexity of your name structure I could see:

  1. First Name
  2. Middle Initial/Middle Name
  3. Last Name
  4. Suffix (Jr. Sr. II, III, IV, etc.)
  5. Prefix (Mr., Mrs., Ms., etc.)
TheTXI
  • 37,429
  • 10
  • 86
  • 110
  • 1
    That's just for american names. – Svante Jan 07 '09 at 20:35
  • This isn't really the question that was asked, but if you're trying to answer "How many fields are needed" then I would argue that the answer is one. How many fields are appropriate to your application's purpose is a whole other question. – Max Jan 08 '09 at 02:45
0

What do you do when you have "The Artist Formerly Known as Prince". That symbol he used is not a character in the unicode set (AFAIK).

It's some levity, but at the same time, names are a rather broad concept that doesn't lend itself well to a structured format. In this case, something free-form might be most appropriate.

casperOne
  • 73,706
  • 19
  • 184
  • 253
  • 1
    Edge case. Ignore. The world will survive 0.000005% of people having hard-to-digitize names. – willc2 Feb 17 '09 at 04:35
  • 2
    @willc2: Not at all. What about names in Japanese, Chinese, Arabic, etc, etc, etc. Those are certainly not edge cases. – casperOne Feb 17 '09 at 15:32