Delete duplicates and update the one unique left all in one statement?

Question

CREATE TABLE hostname_table
(
id INT NOT NULL AUTO_INCREMENT,
hostname CHAR(65) NOT NULL,
interval_avg INT,
last_update DATETIME NOT NULL,
numb_updates INT,
PRIMARY KEY (id)
)

I have this table, and I import 500-600k rows of data into it. I do not check for duplicates when writing to the database, because I want to know how many duplicates of each host there is, and I also want to know the intervals between each update of said hostname.

Example values in hostname_table:

id  hostname          interval_avg  last_update          numb_updates
1   www.host.com      60            2012-04-25 20:22:21  1
2   www.hostname.com  10            2012-04-25 20:22:21  5
3   www.name.com      NULL          2012-04-25 20:22:21  NULL
4   www.host.com      NULL          2012-04-25 20:22:26  NULL
5   www.host.com      NULL          2012-04-25 20:22:36  NULL

Example of what I want it to look like when I have cleaned it up:

id  hostname          interval_avg  last_update          numb_updates
1   www.host.com      25            2012-04-25 20:22:36  3
2   www.hostname.com  10            2012-04-25 20:22:21  5
3   www.name.com      NULL          2012-04-25 20:22:21  NULL

With a huge database like this, I dont want to send too many queries to obtain this goal, but I believe 3 queries are the minimum for an operation like this(if I am wrong, please correct me). Each hour there will be ~500k new rows where ~50% or more will be duplicates, therefore its vital to get rid of those duplicates as efficiently as possible while still keeping a record of how many and how often the duplicates occured(hense the interval_avg and numb_update update).

This is a three step problem, and I was hoping the community here would give a helping hand.

So to summarize in pseudocode, I need help optimizing these queries;

select all last_update and interval_avg values, get sum(numb_update), get count(duplicates) foreach hostname,
update interval_avg in min(id), update numb_updates in min(id), update last_update in min(id) with the value from max(id),
delete all duplicates except min(id)

SOLVED. I have optimized one part by 94%, and another part by ~97% over the course of a couple of days researching. I truely hope this will help other searching for the same solutions. mySQL and large databases can be a big problem if you choose the wrong solution. (I changed the last_update column from DATETIME to INT(10), and I changed from a formated time to a timestamp as value in my final solution to be able to get the max(last_update) and min(last_update) values)

(Thanks to GolezTrol for helping with parts of the problem)

score 4 · Accepted Answer · answered Apr 26 '12 at 07:10

4

You cannot get each different value of interval_avg and numb_updates for a hostname if you want to aggregate by that hostname. Did you mean to SUM or maybe AVG them? Or do you just want to keep the value of the lowest id?

In the query below I sum them.

SELECT 
  MIN(id) as id, 
  hostname, 
  SUM(interval_avg) as total_interval_avg,
  SUM(numb_updates) as total_numb_updates,
  COUNT(*) as hostname_count
FROM
  hostname_table
GROUP BY 
  hostname

After this, you will need to update each found id with the right values for interval_avg and numb_updates.

After that, you will need to delete each id that is not found by this query.

DELETE FROM hostname_table
WHERE
  id NOT IN
    (SELECT 
      MIN(id)
    FROM
      hostname_table
    GROUP BY 
      hostname)

answered Apr 26 '12 at 07:10

GolezTrol

114,394
18
182
210

@GolezTrol: The `DELETE` statement may not be very efficent. I think MySQL does not perform very well when you have subqueries in the `WHERE` of a Delete statement. It will probably be more efficient to move the derived subquery into a Join. – ypercubeᵀᴹ Apr 26 '12 at 08:32
@ypercube And how do you join in a delete? – GolezTrol Apr 26 '12 at 08:50
1

@turbopipp It's not entirely clear what you need for interval_avg, but it looks like you need to recalculate the avarage, so you may either need `AVG(interval_avg)` or even `AVG(interval_avg * numb_updates)`. – GolezTrol Apr 26 '12 at 08:53
@GolezTrol I added example values in my question now, hopefully that will clarify what I try to achieve. :) Thanks for all your help thus far! – turbopipp Apr 26 '12 at 10:39

score 0 · Answer 2 · edited May 23 '17 at 12:14

I went with this solution

--------------------------------

1.select all min+max last_update, sum(interval_avg), sum(numb_update) and count(duplicates) foreach hostname

//This will get the interval_avg value
//(summarize is ok, since all except min(id) will be zero), 
//give a count of how many duplicates there are per hostname, 
//and will also summarize numb_updates
SELECT 
  MIN(id) as id, 
  hostname, 
  SUM(numb_updates) as total_numb_updates,
  SUM(interval_avg) as total_interval_avg,
  MAX(last_update) as last_update_max,
  MIN(last_update) as last_update_min,
  COUNT(*) as hostname_count
FROM
  hostname_table
GROUP BY 
  hostname
HAVING 
  COUNT(*)>1

//Get all last_update from each duplicate hostname(including the original)
//Dont do this in a seperate query, you only need first+last+rowcount to figure
//out the interval average. It took me a while to realize this, so I tried many
//varieties with little success(took too long with +600k rows) 
//
// --- I will include the solution I didn't go for, ---
// --- so others wont do the same mistake ---
//
// START DONT USE THIS
// 2.63sec @ 10000 rows
$sql = "SELECT
  id, 
  ".$db_table.".hostname, 
  last_update 
FROM 
  ".$db_table." 
INNER JOIN (
  SELECT 
    hostname, 
    COUNT(*) 
  FROM 
    ".$db_table." 
  GROUP BY 
    hostname 
  HAVING 
    COUNT(*)>1
) as t2
ON 
  ".$db_table.".hostname = t2.hostname";

$resource = mysql_query($sql,$con);
// END DONT USE THIS (below is a 94% improvement)
//
// START THIS IS BETTER, BUT DONT USE THIS
// 0.16 sec @ 10000 rows
//Select everything from the table
$sql = "SELECT id 
    FROM ".$db_table;
$resource = mysql_query($sql,$con);
$array_id_all = array();
while($assoc = mysql_fetch_assoc($resource)){
    array_push($array_id_all, $assoc['id']);
}

//This will select the ID of all the hosts without duplicates
$sql = "SELECT 
  MIN(id) as id, 
  hostname
FROM
  ".$db_table."
GROUP BY 
  hostname
HAVING 
  COUNT(*)=1";

$resource = mysql_query($sql,$con);

$array_id_unique = array();
while($assoc = mysql_fetch_assoc($resource)){
    array_push($array_id_unique, $assoc['id']);
}

$array_id_non_unique = array_diff($array_id_all, $array_id_unique);
$id_list_non_unique = implode(", ", $array_id_non_unique);

//Select everything from the table when the IDs are IN $id_list_non_unique
$sql = "SELECT * 
    FROM ".$db_table." 
    WHERE id IN (".$id_list_non_unique.")";
$resource = mysql_query($sql,$con);

$array_duplicates = array();
$i=0;
while($assoc = mysql_fetch_assoc($resource)){
    $array_duplicates[$i] = array($assoc['id'], $assoc['hostname'], $assoc['interval_avg'], $assoc['last_update'], $assoc['numb_updates']);
    $i++;
}
// END THIS IS BETTER, BUT DONT USE THIS

(thanks to Nick Fortescue @ https://stackoverflow.com/a/877051/1248273)

2.update interval_avg in min(id), update numb_updates in min(id), update last_update in min(id) with the value from max(id)

//update the interval_avg, last_update and numb_update value of the min(id)
//of each duplicate hostname.
// --- I will include the solution I didn't go for, ---
// --- so others wont do the same mistake ---
//
// START DONT USE THIS
// 167 secs @ 500k rows
UPDATE hostname_table
  SET interval_avg = CASE id
    WHEN 1 THEN 25
    //etc
  END,
  last_update = CASE id
    WHEN 1 THEN "2012-04-25 20:22:36"
    //etc
  END,
  numb_update = CASE id
    WHEN 1 THEN 3
    //etc
  END
WHERE id IN (1)
// END DONT USE THIS
//
// START USE THIS
// 5.75 secs @ 500k rows (96.6% improvement)
INSERT INTO hostname_table (id,interval_avg,last_update,numb_updates)
  VALUES 
    ('1','25','2012-04-25 20:22:36','3'), 
    //etc
ON DUPLICATE KEY UPDATE 
  interval_avg=VALUES(interval_avg), 
  last_update=VALUES(last_update), 
  numb_updates=VALUES(numb_updates)
// END USE THIS

(thanks to Michiel de Mare @ https://stackoverflow.com/a/3466/1248273)

3.delete all duplicates except min(id)

//delete all duplicates except min(id)
ALTER IGNORE TABLE hostname_table ADD UNIQUE (hostname)
ALTER TABLE hostname_table DROP INDEX hostname

(thanks to GolezTrol for a good push in the right direction on selecting the first info I needed)

Delete duplicates and update the one unique left all in one statement?

So to summarize in pseudocode, I need help optimizing these queries;

2 Answers2

I went with this solution

--------------------------------