I'm writing a parser that converts messy author strings into neatly formatted strings in the following format: ^([A-Z]\. )+[[:surname:]]$
. Some examples below:
- Smith JS => J. S. Smith
- John Smith => J. Smith
- John S Smith => J. S. Smith
- J S Smith => J. S. Smith
I've managed to get quite far using various regular expressions to cover most of these, but have hit a wall for instances where a full name is provided in an unknown order. For example:
- Smith John
- John Smith
- Smith John Stone
Obviously regular expressions won't be able to discern what order the forename, surname and middle name(s) are in, so my thought is to perform a lexical analysis on the author string, returning a type and confidence score for each token. Has anyone coded such a solution before, preferably in Perl
? If so, I imagine my code would look something like this:
use strict;
use warnings;
use UnknownModule::NamePredictor qw( predict_name );
my $messy_author = "Smith John Stone";
my @names = split(' ',$messy_author);
for my $name (@names){
my ($type,$confidence) = predict_name($name);
}
I've seen a post here explaining the problem I have, but no viable solution has been suggested. I'd be quite surprised if no one has coded such a solution before if I'm honest, as there are huge training sets available. I may go down this route myself if it hasn't been done already.
Other things to consider:
- I don't need this to be perfect. I'm looking for precision >90% ideally.
- I have >100,000 messy author strings to play with. My goal is to pass as many cleanly as possible, evaluate and improve the approach over time.
- These are definitely author strings, but they're muddled together in lots of different formats, hence the challenge I've set myself.
- For everyone trying to point out that names aren't necessarily possible to categorise. In short, yes of course there will be those instances, hence why I'm gunning for imperfect precision. However, the majority can be categorised pretty comfortably. I know this simply because my human brain, with all its clever pattern recognising abilities, allows me to do it pretty well.
UPDATE: In the absence of an existing solution I've been looking at creating a model from a Support Vector Machine, using LIBSVM. I should be able to build a large and accurate training and test datasets using forenames and surnames taken from PubMed, which has a nice library of >25M articles containing categorised names. Unfortunately these don't have middle names though, just initials.