2

I'm writing a parser that converts messy author strings into neatly formatted strings in the following format: ^([A-Z]\. )+[[:surname:]]$. Some examples below:

  • Smith JS => J. S. Smith
  • John Smith => J. Smith
  • John S Smith => J. S. Smith
  • J S Smith => J. S. Smith

I've managed to get quite far using various regular expressions to cover most of these, but have hit a wall for instances where a full name is provided in an unknown order. For example:

  • Smith John
  • John Smith
  • Smith John Stone

Obviously regular expressions won't be able to discern what order the forename, surname and middle name(s) are in, so my thought is to perform a lexical analysis on the author string, returning a type and confidence score for each token. Has anyone coded such a solution before, preferably in Perl? If so, I imagine my code would look something like this:

use strict;
use warnings;
use UnknownModule::NamePredictor qw( predict_name );

my $messy_author = "Smith John Stone";
my @names = split(' ',$messy_author);
for my $name (@names){
    my ($type,$confidence) = predict_name($name);
}

I've seen a post here explaining the problem I have, but no viable solution has been suggested. I'd be quite surprised if no one has coded such a solution before if I'm honest, as there are huge training sets available. I may go down this route myself if it hasn't been done already.


Other things to consider:

  • I don't need this to be perfect. I'm looking for precision >90% ideally.
  • I have >100,000 messy author strings to play with. My goal is to pass as many cleanly as possible, evaluate and improve the approach over time.
  • These are definitely author strings, but they're muddled together in lots of different formats, hence the challenge I've set myself.
  • For everyone trying to point out that names aren't necessarily possible to categorise. In short, yes of course there will be those instances, hence why I'm gunning for imperfect precision. However, the majority can be categorised pretty comfortably. I know this simply because my human brain, with all its clever pattern recognising abilities, allows me to do it pretty well.

UPDATE: In the absence of an existing solution I've been looking at creating a model from a Support Vector Machine, using LIBSVM. I should be able to build a large and accurate training and test datasets using forenames and surnames taken from PubMed, which has a nice library of >25M articles containing categorised names. Unfortunately these don't have middle names though, just initials.

Community
  • 1
  • 1
D2J2
  • 63
  • 1
  • 9
  • This is off-topic for Stack Overflow, and will probably be closed as such: *Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.* I suggest that you read [What topics can I ask about here?](http://stackoverflow.com/help/on-topic) – Borodin Dec 30 '15 at 12:52
  • 3
    Voting against closing as a reqest for book/tool/module recommendation. This is more "how do I lex" than "what's the best tool for lexing"? – ikegami Dec 30 '15 at 12:59
  • @Borodin How is this off-topic when I've described the problem and what's been done to solve it? That just smacks of trolling to me. – D2J2 Dec 30 '15 at 13:09
  • 1
    The way you formulated your question is what led to Borodin voting to close. You are saying _Does anyone know such a tool_, which is a red flag for off-topic and we are quite pedantic on this kind of thing. Your question is already quite good. You should rephrase that part and take into consideration that ikegami pointed out _lexing_. A _lexer_ is what you need. Alternatively something to build a grammar. Check on that, and come back with a more concrete question. Feel free to [edit] this one to make it on-topic. Also don't limit yourself to names, think a bit more generic, that will help. – simbabque Dec 30 '15 at 14:36
  • I've edited, hopefully that clears it up a bit. Yes, it is a lexical analysis I'm after. However, I don't see why making it more generic would help as the author strings are nearly always names of a sort. My progress towards a solution has so far led me to creating a training set of author strings each categorised by their type of name (token), position of name (token) and number of names (tokens) in the string. This could not be extrapolated into something more generic as the categorisation is entirely dependent on the name itself. Do you agree? – D2J2 Dec 30 '15 at 14:54
  • 2
    You might want to read [Falsehoods Programmers Believe About Names](http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/) – glenn jackman Dec 30 '15 at 15:50
  • 1
    Where does your original data come from? Since you're talking about authors, I assume they've actually published something; information about publications might be available in some database (e.g. the Library of Congress) in a more standardized format. – ThisSuitIsBlackNot Dec 30 '15 at 16:06
  • 2
    ***"That just smacks of trolling to me"*** Certainly not trolling, but I have been known to be wrong! You have nothing to worry about unless four others agree with me, and I see that you have two votes already. Many people make the mistake of assuming that SO is a forum like all the others on the net. But it is more like a *knowledge base*, like Wikipedia, and by posting a question you are implicitly suggesting that your problem is a suitable topic for the site – Borodin Dec 30 '15 at 16:17
  • @Borodin. No problem, just a little frustrating after you spend ages writing the thing only to discover it may get binned over something (in this case) irrelevant. Sorry for calling you a troll! – D2J2 Dec 30 '15 at 16:20
  • 1
    @D2J2: No worries, I understand completely. I've edited my comment and you may want to take another look – Borodin Dec 30 '15 at 16:21
  • @ThisSuitIsBlackNot (funny username!), Unfortunately there's no nice format these can be traced back to in another database. They're just a muddled list of about 99.99% author strings. – D2J2 Dec 30 '15 at 16:23
  • @glenn That link is semi-useful, but still doesn't help my case. I understand these assumptions, but you have to realise this is coming from a slightly different angle. I've updated the post to reflect this. Remember, the goal is to get as high precision as possible - which will probably not be 100%. – D2J2 Dec 30 '15 at 16:32
  • @D2J2 In that case, the approach you describe in your update sounds spot-on. I doubt something quite that sophisticated already exists, but the Lingua namespace contains a lot of useful modules for working with natural language processing. See [Lingua::Names](https://metacpan.org/pod/Lingua::Names), for example, which lets you compare a string against first names from the U.S. census (note that the module is still in alpha). That alone would be too simplistic...you need some heuristics because some names can be first *or* last names, etc., but it's a start. – ThisSuitIsBlackNot Dec 30 '15 at 16:51
  • 2
    As for your question here, it's a great question, but I think it's a bit too broad for Stack Overflow. Any answers would have to be quite long to provide a decent solution. I also disagree that your problem is related to lexing. There's no way for a lexer to distinguish between "Smith John" and "John Smith"; the best it can say is that you have two tokens. – ThisSuitIsBlackNot Dec 30 '15 at 17:05
  • Oh please don't do this. So many places mangle names by trying this sort of thing. Not only is it stupid, but it often prevents proper searching because the name turns out to be something that it shouldn't be. Names are an extremely complicated thing. – brian d foy Dec 30 '15 at 17:53
  • @ThisSuitIsBlackNot: I think ikegami's point is valid. It's the *parser's* job to deliver tokens and the *lexer's* job to make sense of them – Borodin Dec 30 '15 at 21:17
  • @D2J2: As Brian has intimated, this is a *bad idea*. You are trying to automate the correction of malformed or ambiguous data entry, and the result would be data that *looks* reliable but may well not be. Data that looks like `Smith John` needs *verifying*, and Mr John will be very unhappy if messages intended for him are sent elsewhere. The best place for software like this is at the data entry point, and even then a *did you mean?* option is too easy to choose. Data must be verified *at the point of entry* and presumed to be valid thereafter – Borodin Dec 30 '15 at 21:25
  • @Borodin I think you've got that backwards...generally lexers tokenize input and parsers assign meaning to the tokens based on the grammar. But that's neither here nor there. My point is that the OP's problem has nothing to do with lexing or parsing: the only way to distinguish between "Smith John" and "John Smith" is by consulting a database of common names, which I think would fall outside any reasonable definition of parsing. – ThisSuitIsBlackNot Dec 30 '15 at 21:38
  • @ThisSuitIsBlackNot: Thanks, you're right, and I apologise. Lexical analysis or *lexing* is the process of recognising one or more characters as *tokens* in a string. A *parser* has the harder job of making sense of that sequence of tokens. So I withdraw my support of ikegami's comment! – Borodin Dec 30 '15 at 22:06
  • @D2J2: Now that your question has elicited a huge number of comments and three votes to close, but no solution, I hope you understand my initial vote? – Borodin Dec 30 '15 at 22:09
  • 1
    I think a lot of you are still missing the point. Yes, it won't always work when trying to predict which name is a surname and which is a forename. Though, as mentioned in my post, it doesn't need to be perfect. Many of you are assuming I'm changing something that a user intended. This is not true in this case. The problem's arisen from trying to convert a mixture of author formats into one format. I'm actually quite close to a solution now, and will post my answer. Hopefully it'll help someone. – D2J2 Dec 31 '15 at 13:00
  • I'll also take note of the fact that this question was quite broad and keep them more succinct in the future. Thanks for your thoughts, suggestions etc, It's been useful not just for answering my question, but also for a crash course into the nuances of stackoverflow. – D2J2 Dec 31 '15 at 13:02
  • @Borodin yes, very funny! – D2J2 Dec 31 '15 at 13:05

0 Answers0