Parsing a full name into its constituents

Question

We are in need of developing a back end application that can parse a full name into

Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc

Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.

The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.

Is is feasible? We are also open to any commercially available software.

It is impossible to solve in general. Sometimes first name and last name are indistinguishable, for example John Kurt. Both are possible first names. — Andrey, Apr 08 '11 at 14:44
I have a friend with three separate words in his last name. Not hyphenated. Good luck. — Terry Wilcox, Apr 08 '11 at 14:45
If you have to handle Arabic or Indian names, this is going to get even more intangible. I'd suggest you ask your customers to enter their First, Last, Family etc. names separately and simply go by what they say. I don't think it's possible to parse a name into these components. — Noufal Ibrahim, Apr 08 '11 at 14:47
Barney Frank is a politician, Frank Barney is an artist. How could you guess what format the full name is? — Terry Wilcox, Apr 08 '11 at 14:49
The fact alone that you use the expressions "first name" and "last name" suggests that you're underestimating the problems. In many asian countries, the family name is put first - and that's the least of the potential complications. — Michael Borgwardt, Apr 08 '11 at 14:55
Forgot about this. Unless you receive this as a separated values, it can't be done. And I really doubt it is needed. If you have such requirement, that probably mean that somebody didn't clearly understand what he/she really wants. — Paweł Dyda, Apr 08 '11 at 16:01
Country and language is not enough, as people are known to move to other countries. Or the countries change their borders, while the people remain in place. — Bo Persson, Apr 08 '11 at 16:49
Thanks for all the useful suggestions offered here. Will definitely look to avoid this in the first place. — prabhu, Apr 08 '11 at 18:25
Isn't the underlying problem a linguistic one? Maybe a good question for http://linguistics.stackexchange.com/ The field of linguistics dealing with names of people is called "Anthroponymy". — Rinke, Nov 17 '14 at 13:06
Look at [My answer](http://stackoverflow.com/a/39867424/4733655) to similar question. — AmirHossein Manian, Oct 05 '16 at 09:33

score 8 · Answer 1 · answered Apr 08 '11 at 14:44

I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".

Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.

score 8 · Answer 2 · edited Jul 18 '11 at 19:45

Since the OP was open to any commercially available offering...

The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.
Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.

A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html

Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html

The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html

Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html

"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"

http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html

To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."

FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm

Kevin Borders · Answer 3 · 2013-11-30T23:16:58.277

2

Here are two free PHP name parsing libraries for those on a budget:

https://code.google.com/p/php-name-parser/

http://jasonpriem.org/human-name-parse/

And here is a Javasript library in Node package manager:

https://npmjs.org/package/name-parser

edited Nov 30 '13 at 23:16

answered Nov 30 '13 at 17:03

Kevin Borders

2,933
27
32

score 2 · Answer 4 · answered May 26 '14 at 06:05

I wrote a simple human name parser in javascript as an npm module:

https://www.npmjs.org/package/humanparser

humanparser

Parse a human name string into salutation, first name, middle name, last name, suffix.

Install

npm install humanparser

Usage

var human = require('humanparser');

var fullName = 'Mr. William R. Jenkins, III'
    , attrs = human.parseName(fullName);

console.log(attrs);

//produces the following output

{ saluation: 'Mr.',
  firstName: 'William',
  suffix: 'III',
  lastName: 'Jenkins',
  middleName: 'R.',
  fullName: 'Mr. William R. Jenkins, III' }

score 2 · Answer 5 · answered May 27 '15 at 12:48

A basic algorithm could do the following:

First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)

To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues

score 2 · Answer 6 · answered Apr 08 '11 at 15:07

2

The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.

answered Apr 08 '11 at 15:07

Jerry Coffin

476,176
80
629
1,111

1

Even this is problematic, as not all people have a "last name" or "family name", and depending on the jurisdiction, sometimes they are legally treated instead as if they have no first name... – R.. GitHub STOP HELPING ICE Apr 15 '11 at 03:24
1

@R.. +1. There was an Indonesian professor at my university who used one name. Since both "First Name" and "Last Name" were required fields in the database that drove the course scheduling process, he was given the first name "Mr". I guess the "Title" field was either missing or optional. – phoog Jan 09 '13 at 04:29

score 2 · Answer 7 · answered Apr 08 '11 at 15:41

2

Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.

It is much preferable to treat a name as an "atomic" not-splittable entity.

answered Apr 08 '11 at 15:41

mfx

7,168
26
29

3

+1, but then you run into ridiculous cultural conventions like collating by last name.... – R.. GitHub STOP HELPING ICE Apr 15 '11 at 03:25

score 1 · Answer 8 · answered Apr 08 '11 at 14:50

"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name. Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.

But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.

score 1 · Answer 9 · answered Apr 15 '11 at 03:30

As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example

John Robert Miller, Jr.

where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.

Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...

+1 .. According to https://www.w3.org/International/questions/qa-personal-names, Japanese names are collated based on pronunciation. So to take this approach a step further, you would store the actual collation text rather than a pair of offsets, since the collation text isn't always contained in the name. — Pancake, Sep 27 '18 at 02:17

Parsing a full name into its constituents

9 Answers9

Linked