1

I have the following table:

tbl
source    type    date
---       ---     ---
google    A       2010-02-25
google    A       2013-04-11
facebook  C       2008-10-22
facebook  C       2007-01-28

I want to keep only a single entry of each source, and the criteria is select the source tuple with min(date) group by source. The table consists of millions of records, and I'm looking for an efficient way to delete redundant records.

SaadH
  • 1,158
  • 2
  • 23
  • 38

4 Answers4

3

In MySQL, you can do this using a join:

delete t
    from t join
         (select source, min(date) as mindate
          from t
          group by source
         ) tt
         on t.source = tt.source
    where t.date > tt.mindate;

The only way -- off-hand -- that I can think to make this more efficient is to store the aggregation result in a subquery and add an index to it.

I can also add that regardless of the computation for determining the rows to delete, deleting lots of rows in a table in inefficient. Usually, I would recommend a three-step approach:

  1. Write a query to generate the table you want and store the results in a temporary table.
  2. Truncate the original table.
  3. Re-insert the (much) smaller number of rows.
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

In Microsoft SQL, you can try this.

;
WITH cte 
        AS (SELECT ROW_NUMBER() OVER (PARTITION BY source, type
                                        ORDER BY createdate) RN
            FROM   tbsource)
DELETE FROM cte
WHERE  RN > 1;
Madhukar
  • 1,194
  • 1
  • 13
  • 29
  • @GordonLinoff, I'm sorry. The question doesn't contain it is for MySQL. – Madhukar Oct 09 '17 at 13:51
  • There is a mysql tag, but probably should be in the title too if that's what it should be. – Michael K Oct 09 '17 at 14:07
  • @mikato, Yeah correct! Even SQL tag is there which makes this question to see as a generic one. – Madhukar Oct 09 '17 at 14:09
  • And what is Microsoft SQL? Likely you meant Microsoft SQL Server or Microsoft TSQL. This is not splitting hairs since MS Access has a SQL dialect of its own distinct from MSSQL and both are MS products. And [Microsoft Query](https://support.office.com/en-us/article/Use-Microsoft-Query-to-retrieve-external-data-42a2ea18-44d9-40b3-9c38-4c62f252da2e) spans several interfaces. – Parfait Oct 09 '17 at 14:10
  • @Parfait, You're correct. I meant Microsoft SQL Server or Transact SQL here. – Madhukar Oct 09 '17 at 14:11
  • @Madhukar Thanks for your reply. I was looking for a MySQL query. Should have mentioned it in the title too. My mistake. – SaadH Oct 09 '17 at 14:13
  • @SaadH, No problem :) – Madhukar Oct 09 '17 at 14:14
0
delete from t where date not in (select al.d from (select min(date) as d  from t group by source )al);
Nayan Sharma
  • 1,823
  • 18
  • 19
  • Thank you for this code snippet, which might provide some limited, immediate help. A proper explanation [would greatly improve](//meta.stackexchange.com/q/114762) its long-term value by showing *why* this is a good solution to the problem, and would make it more useful to future readers with other, similar questions. Please [edit] your answer to add some explanation, including the assumptions you've made. – Toby Speight Oct 09 '17 at 16:54
0

Add an identity column to the duplicate table as a serial number that acts as a row unique identifier(auto incremental ascending order):

 alter table tbl add sno int identity(1,1)

table

This query selects only non duplicated rows with min(date):

(select min(date),sno From tbl group by source) 

So "sno" will be equals to "1" and "4".

Now join with this table, and delete the records of the join that are duplicated (t.sno is null)

delete E from tbl E
    left join
    (select min(date),sno From tbl group by source) T on E.sno=T.sno
where T.sno is null

table3

Solution adapted from method 3 of this link: LINK

Gaetano Piazzolla
  • 1,388
  • 1
  • 16
  • 31