2

I have a terrible to identify unique record(value). I have a table like this:

ID NAME            DESCRIPTION
1  Yanagida Fumit  best author
2  Ha Il-kwan      new author
3  Fumit Yanagida  best author
4  Ha Il Kwan      new author
5  Ilkwan Ha       new author

There are 5 records in same table called autho table. But in actually, there are 2 authors. First record and third record are stored from one author information and second, 4th and 5th are one author. I want to make this like below.

ID NAME            DESCRIPTION
1  Yanagida Fumit  best author
2  Ha Il Kwan      new author

It means that, I am going to erase all duplicates against reverse name problem. I wonder if I can compare two values(string) in same column. Help me please. I will be happy with your any help!

Leon
  • 141
  • 2
  • 9
  • 1
    You want to use levenshtein distance to resolve this. – Mech Feb 17 '20 at 20:07
  • Convert to SET, where each bit is a presence of definite letter. This is enough to identify >95% of duplicates. – Akina Feb 17 '20 at 20:08
  • @Mech Thanks for your quick answer! Would you mind explain further more detail about levenshtein distance? – Leon Feb 17 '20 at 20:09
  • See this: https://stackoverflow.com/questions/24107101/mysql-using-levenshtein-distance-to-find-duplicates-in-20-000-rows – blackbishop Feb 17 '20 at 20:11
  • Levenshtein is not applicable - the distance between 'Ha Il Kwan' and 'Ilkwan Ha' won't give to identify that they're duplicates. – Akina Feb 17 '20 at 20:13
  • 2
    I have found a similar solution here on stack but I think I am not using it correct :) https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=f10c70f08222a170eb5993c515975a7d – VBoka Feb 17 '20 at 20:14
  • 1
    Thanks everyone! I am very happy for your help. I face levenshtein first time. So I surprise with every answer (especially @VBoka 's answer). :) – Leon Feb 17 '20 at 20:19
  • Are you looking to do this solely through mysql or do you want to run a php script? – Mech Feb 17 '20 at 20:20
  • @Mech I want to solve this problem using only sql query. I might use php script but I think it 's better I use only sql. – Leon Feb 17 '20 at 20:23
  • @Mech so do you have any solution with php script? – Leon Feb 17 '20 at 20:31
  • Having a look at what I can do for you in mysql. Are you looking to specify the terms of the deletion? ie search for "Yanagida Fumit" and delete the rest? – Mech Feb 17 '20 at 20:47
  • @Mech Yes you are right. it has a bit more conditions but I will resolve them myself. Main point is to remove rest after searching. – Leon Feb 17 '20 at 21:01
  • building a solution now. – Mech Feb 17 '20 at 21:39
  • .... so what happens if you have two actual authors with "mirrored" names; `Frank Ben` and `Ben Frank` – Clockwork-Muse Feb 18 '20 at 00:48
  • That's exactly what the provided example is. MATCH AGAINST will find the data, regardless of order. – Mech Feb 18 '20 at 02:02
  • Your table has three different version of the name for "Ha Il Kwan" -- what logic could the code employ to find the one you like best? At a loss to understand how you chose. – Charlie K Feb 24 '20 at 18:20
  • @CharlieK I will prefer "Ha Il Kwan". Anyway thanks for your considering my question. – Leon Feb 25 '20 at 00:48

2 Answers2

0

Here is a php solution:

remove_duplicates("Yanagida Fumit");

function remove_duplicates($full_search_str) {        
    // establish connection to your db

    // fetch data
    $query = "          SELECT  `id`, `name` 
                        FROM    `" . $tbl_name . "`
                        WHERE   MATCH (`name`) AGAINST ('" . $full_search_str . "' IN BOOLEAN MODE)
                        AND     `name` <> '" . $full_search_str . "'
             ";

    // run query
    $results = $conn->query($query);

    // loop through results
    foreach($results as $result) {
            // build query
            $query = "  DELETE 
                        FROM    `" . $tbl_name . "`
                        WHERE   `id` = " . $result['id'];

            // run query
            $result = $conn->query($query);
    }
}
Mech
  • 3,952
  • 2
  • 14
  • 25
  • 1
    ... the PHP is not relevant. Furthermore, it's potentially dangerous due to the unnecessary dynamic SQL (and lack of a prepared statement). Oh, and it would remove **all** instances of a given name, which isn't at all desired. – Clockwork-Muse Feb 18 '20 at 00:35
0

Oracle answer:

create or replace table authors as Select distinct name, description from authors;

Charlie K
  • 114
  • 5
  • ....not sufficient; the data as presented is _already_ distinct (although this may remove some entries). – Clockwork-Muse Feb 18 '20 at 00:45
  • On first read, I didn't see that you have records with data misplaced between fields in a single record as well as inconsistent data for a single field, consequently you're left with questions as to the correct record between: first name last name Ha Il Kwan Il Kwan Ha Ha Il-Kwan This isn't a SQL uniqueness problem -- it's a data corruption problem that is best solved by controlling what gets into the data at the start. – Charlie K Feb 24 '20 at 18:09