1

I have a database of over a million contacts and need to return the best matches for a) user queries and b) batch jobs that run periodically. Not much debate that people name matching is complex and I am considering different routes:

  1. Roll our own (give us something basic to get us out of the blocks). Lots of good threads on this topic, such as How to calculate score for Metaphone/Soundex name searching in .net
  2. Leverage Azure Search / Cognitive Skills: Our platform is already built in Azure and using Azure Search would potentially be less work that (1) and a smaller jump than (3)
  3. Look to 3rd parties outside of Azure that specialise in the space of people name matching (NetOwl / Basistech / etc.).

Given we are scoped to solving the name matching for western style people names, can someone give me the pros and cons of using Azure Search to solve this? Here are some of classes of issues I hope we can address:

  • Phonetic similarity: Jesus <=> Heyzeus
  • Transliteration spelling differences: Abdul Rasheed <=> Abd al-Rashid
  • Alternate names: William <=> Will <=> Bill <=> Billy
  • Missing spaces or hyphens: MaryEllen <=> Mary Ellen <=> Mary-Ellen
  • Truncated name components: McDonalds <=> McDonald <=> McD
  • Optional name tokens: Joaquín Archivaldo Guzmán Loera <=> Joaquín Guzmán
  • Name order variations: Park Sol Mi <=> Sol Mi Park
  • Initials: J. E. Smith <=> James Earl Smith

Thanks in advance for any guidance and help. Simon.

Simon
  • 13
  • 3

1 Answers1

3

Interesting case! I believe there is no right or wrong answer to this solution and it will also depend on budget and time constraints. What is your primary datasource? Are you using a supported source for the Azure Cognitive Search indexer, like SQL or CosmosDB. How are the contact stored? First and last name separated or is everything in just one field?

Since you are mostly looking for guidance around Azure Cognitive Search, I will describe how I would try to tackle this case with Azure Cognitive Search. Hopefully it will help you in deciding which technology suits your purpose the best.

I don't have experience in all cases, please comment on this post if you have better suggestions and I will update it. There are a few similar topics where they are using different technology, but with the same Lucene query syntax and some of the tokenizers.

Phonetic similarity: Jesus <=> Heyzeus

You could add the PhoneticTokenFilter, where you can select the encoder with the best performance for your specific case.

Transliteration spelling differences: Abdul Rasheed <=> Abd al-Rashid

Fuzzy search could be an option, however the example above is just too different.

Alternate names: William <=> Will <=> Bill <=> Billy

You could use SynonymMaps if you have this data.

Missing spaces or hyphens: MaryEllen <=> Mary Ellen <=> Mary-Ellen

You could possibly use a tokenizer that will remove whitespace and punctuation/symbols.

Truncated name components: McDonalds <=> McDonald <=> McD You could use SynonymMaps if you have this data. However I think fuzzy search could do the job already.

Optional name tokens: Joaquín Archivaldo Guzmán Loera <=> Joaquín Guzmán You could leverage Proximity search.

Name order variations: Park Sol Mi <=> Sol Mi Park

Also depends on how the fields are stored, but I think proximity search could solve this case also.

Initials: J. E. Smith <=> James Earl Smith

You could possibly use a tokenizer in combination with Fuzzy Search.. Not sure about this case.

A nice addition is that you can also offer suggestions and/or autocomplete to show the user possible results during typing.

My answers won't solve all cases directly, but it will give you a start. You will have to test and tweak it a lot, thus you should have a look at the time / budget constraint.

Mick
  • 2,946
  • 14
  • 19
  • 1
    Thanks Mick and apologies for the slow response. We will dig in using this as a starting point and let you know how we get on. Cheers, Simon. – Simon Nov 18 '19 at 10:07