0

I am trying to write a query that will return similar rows regarding the "Name" column. My issue is that within my SQL database , there are the following examples:

NAME            DOB       
Doe, John       1990-01-01
Doe, John A     1990-01-01

I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".

I have written this query using MySQL Workbench:

    SELECT
        Name, DOB, id, COUNT(*)
    FROM
        Table
    GROUP BY
        DOB
    HAVING 
        COUNT(*) > 1 ;

However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.

To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column. But again, I do not know the content of the strings in question. Hope this helps clarify my issue.

What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row. I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.

For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.

Lucas
  • 3
  • 2
  • 3
    Possible duplicate of [how to compute similarity between two strings in MYSQL](https://stackoverflow.com/questions/5322917/how-to-compute-similarity-between-two-strings-in-mysql) – Iłya Bursov Aug 08 '18 at 19:05
  • 2
    you may to roll your own version of fuzzy matching: http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/ , or try working with `levenshtein distance`, or yank all your data into something like Python and use fuzzy wuzzy. I don't think there is anything built in that just _does_ it – sniperd Aug 08 '18 at 19:11
  • The solutions in the other question don't require you to know the strings. You can use the same functions with strings that you read from the database. Join the table with itself and compare `t1.name` with `t2.name`. – Barmar Aug 08 '18 at 19:57
  • The method you choose depends on whether you are trying to correct for variations a person might legitimately use for their name, mistyping or deliberate attempts to fake an identity. – symcbean Aug 08 '18 at 20:14
  • @Barmar Sorry, I must have misunderstood the solutions listed there. Thank you so much. – Lucas Aug 08 '18 at 20:25

1 Answers1

0

This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?

Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.

When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.

Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.

And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....

Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....

Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

Matt
  • 13,833
  • 2
  • 16
  • 28
  • Thank you for the response! I understand that this was going to be a complex task. In your example, would you happen to know of the best-fit algorithm to use if I were to not include Johnathan? I know for a fact that the names will be exactly the same, other than the middle initial (edited my post too). From there, I am okay with manually auditing each of the 'potential' duplicates. – Lucas Aug 08 '18 at 20:47
  • @Lucas see update if you have a standard structure you can count on just do some string manipulation instead of using an algorithm think about structure e.g name, name and name, name mi. then think of variations e.g. name, name name mi or name, name name one always has 1 extra space in the string and ends with a single character after the space – Matt Aug 08 '18 at 22:22
  • sorry to keep this going, but what is the syntax for a generic string? How do I match "unknown, value" with "unknown, value%"? – Lucas Aug 10 '18 at 13:09
  • @Lucas It's okay I get it. honestly I started looking at it but I am not up to par on mysql for string manipulation. I would open a new question and instead of asking for an algorithm I would ask how to match between a string with and without the middle initial. If you strip off a middle initial when it has more than 2 spaces it should match you might have some false positives depending on what you write and if names can have 2 first names. Also I would make it explicit to people that it will ALWAYS be one of the 2 formats and have same first name not nickname or misspelling etc. – Matt Aug 10 '18 at 18:42