4

I have a table called emaildata consisting of 4 columns emailaddress, domainname, data and id.

The emailaddress column should contain only unique entries, but there are many duplicates. The domainname and data column are not unique, and as such will contain duplicates which is fine. The id column is set to autoincrement so will contain only unique values.

My question is how do I get rid of all rows that feature duplicate email addresses, keeping the one with the lowest id?

There should be around 370,000 rows, but currently I've got 906,000.

I had an SQL statement before which I used for a similar table and I've tried to adapt it to this without success.

delete T1
from emaildata T1, emaildata T2
where T1.emailaddress = T2.emailaddress
and T1.id > T2.id

The above was based upon the following which was applied to another table and worked fine.

delete T1
from email_list_subscribers T1, email_list_subscribers T2
where T1.emailaddress = T2.emailaddress
and T1.subscriberid > T2.subscriberid

I've tried running this against my table on the remote server in phpmyadmin and after pressing the GO button, the loading bar comes up in the middle, then disappears as if it is processing - but it never does.

I've tried repeating this against the same table running on my home server (XAMPP) via phpmyadmin, and again with HeidiSQL - the same problem with phpmyadmin and Heidi appears to crash.

I've tried other soloutions that i've seen on here but I seem to be getting the same "timeout" / crash problem. I never had issues with the original statement running on the remote server, granted this was against a database a third of the size.

Any info would be appreciated.

ziesemer
  • 27,712
  • 8
  • 86
  • 94

2 Answers2

4

Your query appears to be correct. Your issue seems to be a performance issue, not a logic issue. You'll need to make sure that both your emailaddress and id fields are properly indexed in the database - otherwise with close to a million rows, I would expect your query to hang.

(I would guess that id is probably already indexed, but not emailaddress. Especially with doing a join between tables, if either one of these fields is not indexed, you're going to be looking at a LOT of full table scans.)

Edit:

Seeing your comment that this is the case, you can follow the documentation at http://dev.mysql.com/doc/refman/5.0/en/create-index.html for creating indices. So something like:

CREATE INDEX email_index ON emaildata(emailaddress) USING BTREE;
ziesemer
  • 27,712
  • 8
  • 86
  • 94
  • The id (here called myid) has the following under Indexs under table structure. There are no other indexs on anything, including emailaddress. How, do I go about adding these? myid BTREE Yes No myid 899943 A – Richard Downes Dec 30 '11 at 14:55
  • I'm afraid that that site is a bit too complex for me. I tried the Statement you suggested but got the following. #1146 - Table 'emaildata.lookup' doesn't exist – Richard Downes Dec 30 '11 at 15:18
  • @RichardDownes - see my updated answer. (I had a typo in the `CREATE INDEX` statement in my last comment.) – ziesemer Dec 30 '11 at 15:40
  • That worked Your SQL query has been executed successfully ( Query took 16.3626 sec ) Trying my statement now... – Richard Downes Dec 30 '11 at 15:47
  • Annnnnnd done. Took a while, but my result is 376,256 so it looks like it worked. Thanks for your help :) Have a good New Year. – Richard Downes Dec 30 '11 at 16:26
0

Never tryed to see if array_unique (php function) ever modifies the key but here is how u can do...

select id and email and store them in an array like id => email

after use array_unique to get a new array with the first id for each repetiteve group...thats how function works... and after comapre the 2 arrays and delete the remaining ids from ur table...

this way u get the first id from repetitive groups and unique values

Tudor
  • 1,133
  • 1
  • 12
  • 28
  • Please leave the database processing concern within the database. Don't move the problem across tiers. Trying to bring down close to a million rows in this fashion from the database to the application tier will, at a minimum, require undue time and RAM - if not throwing the application server into an out-of-memory situation first. – ziesemer Dec 30 '11 at 14:59
  • Heh ur right... but you just can't allways leave the raw data from a db unchanged (so sometimes u have to handle out milions of rows to the app serv for future handling)... imagine google that ouputs milions of rows...if i'm worng please tell me... in richard's case, yea its not the best approach but was junt an ideea... – Tudor Dec 30 '11 at 15:13
  • Also some ppl think about db routines out-dated... and they do most part of the logic in the app server... and i state again that my approach is not good for richards case... – Tudor Dec 30 '11 at 15:30