2

I have an application that will store and track visitors. These visitors are created in the system by schedulers(users) as needed when they set up a visit. The problem is that most of the time the only important unique identifiers of a visitor are as follows:

  • First Name
  • Last Name
  • Company Name

The risk of duplicate records existing for the same person is inherent, a scheduler may enter a new visitor record in lieu of searching the system for somebody existing by that name.

When I encounter somebody entering a visitor by the same name I display a warning dialog with various suggestions of who this person COULD be, but then even that is not good enough.

I could enter 'Jim Jones' and this person may exist in the system as 'James Jones' or 'Jimmy Jones'. I see there are name recognition software packages available but they are expensive and certainly more heavy than what I am looking for.

Would anybody know where to find a free or open source dictionary file that I can programatically access to find potential name variants? Software or an online service would be nice but even just a data dump or simple text file might do.

I know even this will not prevent duplicate visitor records, I am just trying to keep that at a minimum so it is not a critical feature.

maple_shaft
  • 10,435
  • 6
  • 46
  • 74
  • I want to clarify from the design description above, when I say a scheduler may enter a new visitor record in lieu of searching the system, I mean that behaviour is by design. The user base will be assumed to have minimal computer skills so a clean simple hand-holding flow is necessary. – maple_shaft May 06 '11 at 12:47

1 Answers1

2

Check out the Moby project (http://icon.shef.ac.uk/Moby/mwords.html) for common first and last names. You can do a precomputation for similar names using tools like metaphone and soundex and use that to identify potential matches. You also mention company names which are a bit harder to manage since they can be made up of lots of things, for that maybe check out the 12-dicts word list (http://wordlist.sourceforge.net/) the 2+2lemma list provided in that package provides multiple forms that share common roots which can be used in conjunction with a simiar spelling solution to provide improved results.

  • Thanks for posting, I will check out those links and let you know how that works out. To clarify I am not concerned about searching for Companies. The Company field will not be a search field, but it is displayed to uniquely distinguish two visitors with the exact same name. – maple_shaft May 06 '11 at 12:53
  • Hmm... having trouble figuring out what to do with the files I unpacked when I downloaded the Moby dictionary. The readme is no help whatsoever. – maple_shaft May 06 '11 at 13:07
  • Well the Moby dictionary is a start, but not quite what I am looking for. It has an impressive set of names but then I can't really do much without the comparison list. The Metaphone and Soundex algorithms that I tested won't work either because they will only find names that SOUND similar which is not what I am looking for. If my search term is 'William', it should be able to search for variants like 'Bill', 'Billy', 'Will', 'Willy', 'Willie', etc... With a list like that I can easily write a query to find all visitors IN the list of name variants. – maple_shaft May 06 '11 at 13:47
  • The 2+2lemma list actually does this but only for words no names...sorry working on a very similar project (at least this one aspect) myself and am using the lemma list for general purpose matching but have not found any good name form list so far anyway. – lostatredrock May 06 '11 at 18:46
  • 1
    Took a look at some of the other posts linked to the name-matching tag and ran across this http://deron.meranda.us/data/nicknames.txt not super expansive but better than nothing...going to load it into my translation data-set. – lostatredrock May 06 '11 at 18:54
  • 1
    NICE!!! I also found this in the form of a csv file http://code.google.com/p/nickname-and-diminutive-names-lookup/downloads/list. I was able to load these into an object and find matches. I will probably try to write a script that can use the links you provided to plug the holes in my csv then I should have a really nice dictionary going. Thanks for all the help! – maple_shaft May 06 '11 at 19:38