how to recognize similar words with difference in spelling

Question

I want to filter out duplicate customer names from a database. A single customer may have more than one entry to the system with the same name but with little difference in spelling. So here is an example: A customer named Brook may have three entries to the system with this variations:

Brook Berta
Bruck Berta
Biruk Berta

Let's assume we are putting this name in one database column. I would like to know the different mechanisms to identify such duplications form say a 100,000 records. We may use regular expressions in C# to iterate through all records or some other pattern matching technique or we may export these records to what ever best fits for such queries (SQL with Regular Expression capabilities)).

This is what I thought as a solution

Write a C# code to iterate through each record
Get only the Consonant letters in order (in the above case: BrKBrt)
Search for the same Consonant pattern from the other records considering similar sounding letters like (C,K) (C,S), (F, PH)

So please forward any ideas.

Are you sure this is safe? What about names like Andrei/Andreea (Romanian male/female names): would they be considered the same and if so, if you had brothers/spouses with the same last name you'd exclude one? — Rox, Jun 22 '10 at 08:01

score 8 · Accepted Answer · answered Jun 22 '10 at 08:00

8

The Double Metaphone algorithm, published in 2000, is a new and improved version of the Soundex algorithm that was patented in 1918.

The article has links to Double Metaphone implementations in many languages.

answered Jun 22 '10 at 08:00

Ray Burns

62,163
12
140
141

score 2 · Answer 2 · edited Nov 27 '16 at 13:55

2

The obvious, established (and well documented) algorithms for finding string similarity are:

edited Nov 27 '16 at 13:55

mkj

2,761
5
24
28

answered Jun 22 '10 at 08:01

symcbean

47,736
6
59
94

score 2 · Answer 3 · answered Jun 22 '10 at 08:03

2

Have a look at Soundex

There is a Soundex function in Transact-SQL (see http://msdn.microsoft.com/en-us/library/ms187384.aspx):

SELECT 
SOUNDEX('brook berta'),
SOUNDEX('Bruck Berta'),
SOUNDEX('Biruk Berta')

returns the same value B620 for each of the example values

answered Jun 22 '10 at 08:03

Mario Menger

5,862
2
28
31

Yes, but the B620 is for the first word only. – John Saunders May 06 '11 at 20:08

score 1 · Answer 4 · answered Jun 22 '10 at 08:00

1

I would consider writing something such as the "famous" python spell checker.

http://norvig.com/spell-correct.html

This will take a word and find all possible alternatives based on missing letters, adding letters, swapping letters, etc.

answered Jun 22 '10 at 08:00

Robin Day

100,552
23
116
167

score 1 · Answer 5 · answered Jun 22 '10 at 08:01

1

You might want to google for phonetic similarity algorithm and you'll find plenty of information about this. Including this article on Codeproject about implementing a solution in C#.

answered Jun 22 '10 at 08:01

Hans Olsson

54,199
15
94
116

score 1 · Answer 6 · answered Jun 22 '10 at 08:01

1

Look into soundex. It's a pretty standard library in most languages that does what you require, i.e. algorithmically identify phonetic similarity. http://en.wikipedia.org/wiki/Soundex

answered Jun 22 '10 at 08:01

Leo

6,553
2
29
48

score 1 · Answer 7 · answered Jun 22 '10 at 10:02

1

There is a very nice R (just search for "R" in Google) package for Record Linkage. The standard examples target exactly your problem: R RecordLinkage

The C-Code for Soundex etc. is taken directly from PostgreSQL!

answered Jun 22 '10 at 10:02

FloE

1,166
1
10
19

score 0 · Answer 8 · answered Jun 22 '10 at 09:26

0

I would recommend Soundex and derived algorithms over Lev distance for this solution. Levenstein distance more appropriate for spell checking solutions imho.

answered Jun 22 '10 at 09:26

James Westgate

11,306
8
61
68

how to recognize similar words with difference in spelling

8 Answers8

Linked

Related