Single Column Finding similar values

Question

I have a column sample data in EXCEL/SQL Table as below. Want to group similar records. Solution in excel or sql should be fine.

GOPAL
Ramesh
A.GOPAL
SRINIVAS
Ramesh Reddy
G.SRINVIASA RAO

The expected Output maybe sorting on similarity

 GOPAL
 A.GOPAL
 Ramesh
 Ramesh Reddy
 SRINIVAS
 G.SRINVIASA RAO

or

GOPAL  Group1
A.GOPAL Group1
Ramesh Group2
Ramesh Reddy Group2
SRINIVAS Group3
G.SRINVIASA RAO Group3

What do you mean by group similar records? Can you explain it clearly? — The_Outsider, Jul 07 '17 at 20:14
https://stackoverflow.com/questions/14700074/how-to-find-the-match-between-text-that-typo-in-excel basically sorting by soundex() — xQbert, Jul 07 '17 at 20:18

rd_nielsen · Accepted Answer · 2017-07-07T23:49:01.890

This is more of a comment than a complete answer, but may steer you in the right direction.

You will have to define a similarity metric that compares two strings and returns a value that quantifies the similarity between them. See, for example, A better similarity ranking algorithm for variable length strings. How you implement that to use in a SQL query depends on what DBMS you're using. You will also have to establish a threshold value of this metric that you want to define as 'close enough'.

Then you can write a SQL query something like this:

select
    a_val as group,
    b_val as stringval
from
    (
    select
        a_val,
        b_val,
        similarity
    from
        (
        select
            a.stringval as a_val, 
            b.stringval as b_val, 
            sim_metric(a.stringval, b.stringval) as similarity
        from
            stringlist as a,
            stringlist as b
        where
            b.stringval >= a.stringval
        ) as sim
    where
        similarity > similarity_threshold
    ) as grp
order by
    a_val;

where sim_metric() is your similarity function and similarity_threshold is your chosen threshold. One name in each group of similar names is used as the group name.

Thanks a lot. It really helps. Created a CLR function based on C#code given in the link and it gives me desired results — Srinivasa Rao, Jul 09 '17 at 17:28

Single Column Finding similar values

1 Answers1