CREATE TABLE hostname_table
(
id INT NOT NULL AUTO_INCREMENT,
hostname CHAR(65) NOT NULL,
interval_avg INT,
last_update DATETIME NOT NULL,
numb_updates INT,
PRIMARY KEY (id)
)
I have this table, and I import 500-600k rows of data into it. I do not check for duplicates when writing to the database, because I want to know how many duplicates of each host there is, and I also want to know the intervals between each update of said hostname.
Example values in hostname_table:
id hostname interval_avg last_update numb_updates
1 www.host.com 60 2012-04-25 20:22:21 1
2 www.hostname.com 10 2012-04-25 20:22:21 5
3 www.name.com NULL 2012-04-25 20:22:21 NULL
4 www.host.com NULL 2012-04-25 20:22:26 NULL
5 www.host.com NULL 2012-04-25 20:22:36 NULL
Example of what I want it to look like when I have cleaned it up:
id hostname interval_avg last_update numb_updates
1 www.host.com 25 2012-04-25 20:22:36 3
2 www.hostname.com 10 2012-04-25 20:22:21 5
3 www.name.com NULL 2012-04-25 20:22:21 NULL
With a huge database like this, I dont want to send too many queries to obtain this goal, but I believe 3 queries are the minimum for an operation like this(if I am wrong, please correct me). Each hour there will be ~500k new rows where ~50% or more will be duplicates, therefore its vital to get rid of those duplicates as efficiently as possible while still keeping a record of how many and how often the duplicates occured(hense the interval_avg and numb_update update).
This is a three step problem, and I was hoping the community here would give a helping hand.
So to summarize in pseudocode, I need help optimizing these queries;
- select all last_update and interval_avg values, get sum(numb_update), get count(duplicates) foreach hostname,
- update interval_avg in min(id), update numb_updates in min(id), update last_update in min(id) with the value from max(id),
- delete all duplicates except min(id)
SOLVED. I have optimized one part by 94%, and another part by ~97% over the course of a couple of days researching. I truely hope this will help other searching for the same solutions. mySQL and large databases can be a big problem if you choose the wrong solution. (I changed the last_update column from DATETIME to INT(10), and I changed from a formated time to a timestamp as value in my final solution to be able to get the max(last_update) and min(last_update) values)
(Thanks to GolezTrol for helping with parts of the problem)