0

lets see my code:

function checkForDuplicates() {            
           $data = $this->input->post();
           $project_id = $data['project_id'];

           $this->db->where('project_id', $project_id);
           $paper = $this->db->get('paper')->result();

           $paper2 = $paper; //duplica o array de papers
           $duplicatesCount = 0;

           foreach($paper as $p){
               $similarity = null;

                foreach($paper2 as $p2){
                    if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
                        if($p -> paper_id !== $p2 -> paper_id){ 
                            similar_text($p -> title, $p2 -> title, $similarity);

                            if ($similarity > 90) { 
                                $p -> status_selection_id = 4;
                                $this->db->where('paper_id', $p -> paper_id);
                                $this->db->update('paper', $p);
                                $duplicatesCount ++;
                            }
                        }
                    }
                }
            }

            $data = array(
                'duplicatesCount' => $duplicatesCount,
                'message' => 'Duplicates where found!'
            );
            echo json_encode($data);
        }
  1. similar_text takes 180 seconds to check 1500 records.
  2. levenshtein takes 101 seconds to check 1500 records.
  3. if($pp1 === $pp2) takes 45 seconds to check 1500 records.

what would be the quickest way to check duplicate records and change their status?

  • the same operation you can do by query. – er.irfankhan11 Jul 26 '18 at 05:22
  • do you want to check just duplicate of id? – AbdulAhmad Matin Jul 26 '18 at 05:26
  • You can't just give us a code and directly start with best methods to do it. Provide us more context. What is the code doing? – nice_dev Jul 26 '18 at 05:33
  • https://stackoverflow.com/questions/3338889/how-to-find-similar-results-and-sort-by-similarity – Atural Jul 26 '18 at 05:37
  • Can you explain : are DB IO relevant or only the algo ? In all cases you should extract all needed data from DB, process Data, and bulk update at the end. – quazardous Jul 26 '18 at 05:57
  • I'm looking for duplicate titles and sorting them with the duplicate status. $paper is an array of objects. $paper -> title It's my real interest. @quazardous, I want a way to make the process faster, understand? sry about my english! – Junior Reis Jul 26 '18 at 06:12
  • @AbdulAhmadMatin I'm interested in the title, but I need to know the code to change the status in the query. – Junior Reis Jul 26 '18 at 06:20

1 Answers1

1

Optimize is often reduce IO.

In your case reducing the number of SQL queries should improve the processing time.

If you need to process a large amount of records, you should split it in chunks. Each chunk should contain a batch of records that can fit into memory (RAM).

Retrieve your chunk from DB. Process your chunk (ie with a loop) and keep track of the changes you need to do in DB using array (ie). At the end bulk update the DB with as few queries you can.

       $data = $this->input->post();
       $project_id = $data['project_id'];

       $this->db->where('project_id', $project_id);
       $paper = $this->db->get('paper')->result();

       $paper2 = $paper; //duplica o array de papers
       $duplicatesCount = 0;

       // keep track of updates
       $updates = [];

       foreach($paper as $p){
           $similarity = null;

            foreach($paper2 as $p2){
                if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
                    if($p -> paper_id !== $p2 -> paper_id){ 
                        similar_text($p -> title, $p2 -> title, $similarity);

                        if ($similarity > 90) { 

                            $updates[] = [
                                'paper_id' => $p -> paper_id,
                                'status_selection_id' => 4,
                            ];

                            $duplicatesCount ++;
                        }
                    }
                }
            }
        }

        if ($duplicatesCount > 0) {
             // here you have to create a big SQL request with all the updates
             // maybe your DB adaptor can do it for you ?
             $query = $this->db->somethingToCreateABulkQuery();
             foreach ($updates as $update) {
                 // stuff 
                 $query->somethingToAddAndUpdate($update);
             }
             $this->db->somethingToExecuteTheQuery($query);

        }
quazardous
  • 846
  • 10
  • 15
  • I tried as you said, the time improved a few seconds, but still bad =(, thx bro <3 – Junior Reis Jul 27 '18 at 00:46
  • Well OK was not the IO this time :/ You have to come with another way to compute the distance or bear with the process time. First your process seams incremental (you test status_selection_id). So it's slow but predictable. So if it's in a cron... Second maybe you can use more preprocess. ie create a word vector and store it. Instead of using text_similarity use a vector distance function first (and text_similarity if distance is little enough). – quazardous Jul 28 '18 at 06:18