1

I was granted with the beautiful task ;-) to design some tables in a MySQL Database which should hold human names.

Criteria:

  1. I have only the full names. (There is no separation for e.g. prename, surname and so on)
  2. The storage should be diacritic sensitive. (The following names stand for different persons)

    • "Voss" and "Voß".
    • "Joel" and "Joël".
    • "franc" and "Franc" and "Fránc".
  3. A search should return all similar names to the search string: E.g: Search for "franc" should return ["franc", "Franc", "Fránc"] and so on... (It would be awesome if the search would return not only the diacritice insensitive matches but perhaps similar sounding names or names that match in parts to the search string, too...)

I thougt of using the COLLATION utf8_bin for the column (declared as unique) in which I will store the names. This would satisfy point 2. But this will hurt point three. Declaring the column name as unique with collation utf8_unicode_ci satisfys point 3. but it hurts point two.

So my question is: Is there a way to solve this task and respecting all criteria? And since I don't want to reinvent the wheel: Is there an elegant way to handle human names (and their searches) in databases? (Sadly, I do not have the possibility of splitting the names into prename, surnames and optional middlenames...)

Edit:

The amount of names is arount a million (~1.000.000) entrys. And if it matters: I am using python as scripting language to populate the database and query the data later on.

Aufwind
  • 25,310
  • 38
  • 109
  • 154
  • Can you specify how many records might be in the table? This would effect which solution I would suggest... – designosis Jul 29 '11 at 04:50
  • @neokio: I think around a million entrys. (Edited my question). – Aufwind Jul 29 '11 at 04:53
  • can you at least have a companion table containing component names (i.e. names decomposed as words)? I spent about a year working with name matching. You wouldn't believe what you run into: names with numbers in them, names with a symbol in them ( stuff like *e but pronounced Starry). Purposeful creative spelling that almost guarantees names are never entered the same by two different data entry people (like 30+ variations of name Kaylie, K-Lee, Cayleigh, and so on). It's crazy. – hatchet - done with SOverflow Jul 29 '11 at 05:12
  • @hatchet: Well I could split the full names by space. Then I would have decomposed words. Would this help? (If I got you right...) – Aufwind Jul 29 '11 at 05:22

2 Answers2

2

What is useful is if you can decompose the full name into component "name words" and store a phonetic encoding (metaphone or one of the many other choices) for each of them. You just need the notion of name words though, not specifically categorizing it as first or middle or last, which is fine because those categories don't work well across cultures anyway). But you can use positional order information later in ranking if you want so that searching for "Paul Carl" matches "Paul Karl" better than matching "Carl Paul". You need to be aware of ambiguous punctuation that may require storing multiple versions of some name words. For instance Bre-Anna Heim would be broken into the name words "bre" "anna" "breanna" and "heim". Sometimes the dash is irrelevant like Bre-Anna, but sometimes not like in Sally-June". Bre-Anna never uses just Bre or Anna, but Sally-June may just use Sally or just June sometimes. It's hard to know which, so cover both possibilities.

You can write your query against this by similarly decomposing and phonetically encoding the full name you're searching for. Your query can return, say, those full names that have two or more component name phonetic matches (or one if there is only one name in the search or the source). This gives you a subset of full names to consider further. You could come up with a simple ranking of them, or even do something like a distance matching algorithm on this subset, which would be too expensive computationally to do against the entire million names. When I say distance matching, I'm talking on-line algorithms like Levenshtein distance and the like.

(edit) The reasoning for this is handling cases like the following name: Maria de los Angeles Gomez-Rodriguez. One data entry person may just enter Maria Gomez. Another might enter Maria Gomez Rodriguez. Yet another might enter Maria Angeles Rodrigus.

1

You can use an algorithm like Metaphone (or Double Metaphone) in another column so that you can try to find names that are "similar" to each other. You will have to look for an international version that knows about the german esset character.

scwagner
  • 3,975
  • 21
  • 16