18

What is the most efficient way to implement a phonetic search in C++ and/or Java? By phonetic search I mean substituting vowels or consonants that sound similar. This would be especially useful for names because sometimes people's names have sort of strange spellings.

I am thinking it might be effective to substitue vowels and some consonants. It may also be good to include some special cases like silent E's at the end or F and PH. Would it be best to use cstrings or strings in C++? Would it be better to store a copy in memory with the substituted values or call a function every time we look for something?

ctype.h
  • 1,470
  • 4
  • 20
  • 34

2 Answers2

23

Besides Soundex you'll find also the Metaphone or Double Metaphone phonetic algorithm, which seem to be an improvement for the English pronunciation and is a quite new algorithm.

For the german pronunciation I use the "Kölner Phonetik".

Apache Commons Codec gives you a very simple Java implementation of those basic algorithms (Soundex, Metaphone, ...) http://commons.apache.org/codec/ For example see the javadoc for the soundex: http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/Soundex.html

Just by typing following code you the the phonetic value of your String:

Soundex soundex = new Soundex();
String phoneticValue = soundex.encode("YourString");

And then you can simply do it for two strings and compare the phonetic values. Hava a look at the following post if you're comparing two strings, because the equals() methods is just black and white, and maybe you'd like to know how many % it is matching:

How to compare almost similar Strings in Java? (String distance measure)

Community
  • 1
  • 1
Philipp
  • 4,645
  • 3
  • 47
  • 80
  • Do you know a JAVA implementation of the "Kölner Phonetik" – mica Oct 07 '12 at 09:56
  • 1
    Yes - we used the apache commons codec. Here you find the "ColognePhonetic" class. 'new ColognePhonetic().encode("Hans")'. But we are not anymore using it for the german language, it seemed to ignore too many things and almost all words were considered as equal. – Philipp Oct 08 '12 at 06:39
  • 2
    for german I found the Hannover-phonetics, a java implementation phonet4java, cab be found here: http://code.google.com/p/phonet4java – mica Oct 08 '12 at 18:30
15

Soundex along with its variants is the standard algorithm for this. It uses phonetic rules to transform the name into an alphanumeric code. Names with the same code are grouped together.

As far as implementing the search, I'd use a data structure that maps each soundex code to the list of names that have that code. Depending on the data structure used (a hash table or a tree), the lookup could be done in time that is either constant on logarithmic in the number of distinct soundex codes.

I am not sure what exactly you mean by cstring (Microsoft's CString?) but the standard std::string class will be perfectly fine for this problem and would be my preferred choice.

NPE
  • 486,780
  • 108
  • 951
  • 1,012