1

I want to run through a data-set and see if there are any duplicates in it. I am wondering which way would be more efficient for the server. Running through the data-set and sending new queries to the database, or running through the data-set with a second nested for loop.

I think the code example will make it more understandable.

SELECT * FROM contacts;

Let's say this query yields an array:

[0] {id: 1, firstname: 'John', lastname: 'Smith'}
[1] {id: 2, firstname: 'Michael', lastname:'Jones'}
...
[99] {id: 100, firstname: 'Jerry', lastname:'Brown'}

And is saved into a php array:

$data

Now let's say that I want to run through each contact in $data and then see if there are any contacts with the same first name and last name. Which method would be more efficient?

1:

for($i = 0 ; $i < sizeof($data) ; $i++){
    #query db:
    $newQuery = SELECT * FROM Contacts WHERE firstname=$data[$i]['firstname'] AND lastname=$data[$i]['lastname'];
    if(sizeof($newQuery > 1)){
         #log contacts.
    }
}

2:

for($i = 0 ; $i < sizeof($data) ; $i++){
    for($j = $i+1; $j < sizeof($data); $i++){
        if($data[$i]['firstname'] === $data[$j]['firstname'] && $data[$i]['lastname'] === $data[$j]['lastname']){
            #log contacts;
        }
    }
}

Of course if i'm going about this totally wrong and there's a better way to do it all together I would be happy to learn!

CodeAt30
  • 874
  • 6
  • 17
  • Neither. `Select firstName, lastName, count(*) from contacts group by firstName, LastName having having count(*) > 1` Get the data you need when you need it; If you need more data such as the ID or other data; run a second query based on the first/lastname to get only those records. a database set based processing is going to be faster than nearly anything one can write. – xQbert Apr 05 '18 at 13:31
  • Hello don't know exactly what you want but you should definitely build one request and put all your conditions in it... – Ermac Apr 05 '18 at 13:32
  • @xQbert Ok I understand . I didn't think of it. This is much better. However I would like to add something: What if I didn't want to necessarily check for total Duplicates, but similar names. – CodeAt30 Apr 05 '18 at 13:32
  • use soundex() for sounds like pattern matches https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_soundex there are other pattern matches as well soundex() is just one example. Jaro-Winkler fuzzy logic matching is another. See: https://stackoverflow.com/questions/48406993/jaro-winkler-function-why-is-the-same-score-matching-very-similar-and-very-diff – xQbert Apr 05 '18 at 13:34
  • @xQbert wow thanks! Is there something like `levenshtein()` in mySQL? – CodeAt30 Apr 05 '18 at 13:41
  • https://stackoverflow.com/questions/13909885/how-to-add-levenshtein-function-in-mysql or https://gist.github.com/Kovah/df90d336478a47d869b9683766cff718 (untested functions in those links: I assume they work however) – xQbert Apr 05 '18 at 13:42
  • Thanks dude/ette I was totally unaware that SQL extension functionality, this will ease it all up quite a bit! – CodeAt30 Apr 05 '18 at 13:44
  • 1
    You just happened to catch me on a good day :P You may want to adjust your question to include "Similar" names in your matching. but those get tricky. Once you go down this path the opportunity for false positives and possible incorrect results/matches increases. 1/2 the battle is knowing what to look for. Hopefully this puts you on a more productive path. – xQbert Apr 05 '18 at 13:47
  • @xQbert Yeah I was just quite off the proper way to go about this. I wanted to understand something more fundamental first which was accomplished with this question and answers. Now I can go about the similarity issue with a proper basis to build from. I was solving it with unnecessary php scripting earlier. – CodeAt30 Apr 05 '18 at 14:06

1 Answers1

1

The most efficient way is to let the database do the work. This is a general rule. Database optimizers have more options than looping through arrays. Databases servers are often more powerful than the nodes where applications run. And, passing back all the data can be (relatively) expensive compared to returning only the results you want. (That said, there are exceptions but they are rare.)

If you want the name pairs, then you can do:

select firstname, lastname, count(*) as cnt
from contacts
group by firstname, lastname
having cnt > 1;

If you want the original rows that are the duplicates, then I would recommend exists:

select c.*
from contacts c
where exists (select 1
              from contacts c2
              where c2.firstname = c.firstname and c2.lastname = c.lastname and
                    c2.id <> c.id
             );

For this query, you want an index on contacts(lastname, firstname).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786