Find similar Column Names using difference SYNTAX

Question

I am working with a database with approximately 100k entries and want to find all similar names in this database that I put within one column. I am now using soundex but the results are way to fuzzy and filtering those fuzzy results in my php makes the process with so many soundex classes and entries in the database very slow so I hope there is another way to filter out better matches than soundex does.

My Query:

SELECT soundex(full_name) AS soundex, 
    full_name AS customer_name
FROM (SELECT CONCAT(cu.first_name,' ', cu.last_name) AS full_name
    FROM `customers` AS cu  
    WHERE cu.`status` = 1) a
ORDER BY soundex(full_name))

So I compare all the names that I put into one column and show them all ordered by soundex. Is there a way to user DIFFERENCE(soundex, soundex) in a perfomatively good way besides cross joining the whole table and compare each and every name with each other? Or is there a good way to sufficiently sort out not very similar names?

Are you using MySQL? Or SQL Server? your code is confuse me a bit — Ilyes, Nov 13 '18 at 14:27
@Sami I am using a MYSQL table. Where does my code confuse you? So I can check if it is understandable. — Eve Edomenko, Nov 13 '18 at 14:34
@EveEdomenko as you've used a t-sql documentation link there might be confusion — Barry, Nov 13 '18 at 14:35
@Barray thank you for your comment. I am using MySQL. I also edited the post and deleted the link to the t-sql documentation and hope my question is clearer. — Eve Edomenko, Nov 13 '18 at 14:47
You might want to include the primary key of that table. Then you can at least avoid matching the names with themselves. — LukStorms, Nov 13 '18 at 14:53
@LukStorms that sounds like a good idea to start with but how does a primary key avoid matching all the names with each other? — Eve Edomenko, Nov 13 '18 at 15:04
The id's need to be different. F.e. `... FROM customer cu1 JOIN customer cu2 ON cu2.id <> cu1.id AND ...` . Not exactly great for performance because such criteria mostly ignores an index. But you get the point. Btw, interesting old SO post [here](https://stackoverflow.com/questions/634995/implementation-of-levenshtein-distance-for-mysql-fuzzy-search) — LukStorms, Nov 13 '18 at 15:08
Thank you, I will include the addition with the id. I dont know it the levensthein distance in that case is beneficial as there are a lot of comparisons that need to be done but I will keep you posted when i find a solution that might not take forever to calculate. — Eve Edomenko, Nov 13 '18 at 15:22

score -1 · Answer 1 · answered Nov 13 '18 at 14:30

the soundex for a full_name might not be the best way to fuzzy match. Have you had a look at the levenshtein function implementation. If you use that you can get the distance between two strings and use that to sort on the best matches.

See the following example. Levenshtein distance in T-SQL

Find similar Column Names using difference SYNTAX

1 Answers1