Delete all records except the most recent one?

Question

I have two DB tables in a one-to-many relationship. The data looks like this:

select * from student, application

Resultset:

+-----------+---------------+---------------------+
| StudentID | ApplicationID | ApplicationDateTime |
+-----------+---------------+---------------------+
| 1         | 20001         | 12 April 2011       |
| 1         | 20002         | 15 May 2011         |
| 2         | 20003         | 02 Feb 2011         |
| 2         | 20004         | 13 March 2011       |
| 2         | 20005         | 05 June 2011        |
+-----------+---------------+---------------------+

I want to delete all applications except for the most recent one. In other words, each student must only have one application linked to it. Using the above example, the data should look like this:

+-----------+---------------+---------------------+
| StudentID | ApplicationID | ApplicationDateTime |
+-----------+---------------+---------------------+
| 1         | 20002         | 15 May 2011         |
| 2         | 20005         | 05 June 2011        |
+-----------+---------------+---------------------+

How would I go about constructing my DELETE statement to filter out the correct records?

Is it necessary that these students have more than one application at *any* time? If not, you could just have a one-to-one relationship. — NullUserException, Aug 30 '11 at 05:44
As an aside, I can't alter the DB schema, or change relationships between tables etc. — sim, Aug 30 '11 at 06:28
What do your tables look like (ie: `DESC student; DESC application`)? — NullUserException, Aug 30 '11 at 06:28
Your select statement seems wrong. You are not joining the two tables. — , Aug 30 '11 at 08:46

score 27 · Accepted Answer · 2016-11-21T15:16:02.340

27

DELETE FROM student
WHERE ApplicationDateTime <> (SELECT max(ApplicationDateTime) 
                              FROM student s2
                              WHERE s2.StudentID  = student.StudentID)

Given the long discussion in the comments, please note the following:

The above statement will work on any database that properly implements statement level read consistency regardless of any changes to the table while the statement is running.

Databases where I definitely know that this works correctly even with concurrent modifications to the table: Oracle (the one which this question is about), Postgres, SAP HANA, Firebird (and most probably MySQL using InnoDB). Because they all guarantee a consistent view of the data at the point in time when the statement started. Changing the <> to < will not change anything for them (including Oracle which this question is about)

For the above mentioned databases, the statement is not subject to the isolation level because phantom reads or non-repeatable reads can only happen between multiple statements - not within a single statement.

For database that do not implement MVCC properly and rely on locking to manage concurrency (thus blocking concurrent write access) this might actually yield wrong results if the table is updated concurrently. For those the workaround using < is probably needed.

edited Nov 21 '16 at 15:16

answered Aug 30 '11 at 08:45

I'd suggest to use '<' instead of '<>' to avoid potential problems with concurrent transactions. – No-Bugs Hare Nov 17 '16 at 09:45
@No-BugsHare: the statement won't see any changes done by concurrent transactions. So changing `<>` to `<` won't make a difference – Nov 18 '16 at 08:50
behaviour you're referring to, _heavily_ depends on a "transaction isolation level" (which is mostly-never set to Serializable in practice - and it looks that even "RR" is not enough to make this completely safe as "phantom reads" possible under RR seem to cause potential trouble to this particular statement); hence - the '<' is safer – No-Bugs Hare Nov 19 '16 at 16:16
@No-BugsHare: no it doesn't, not with any modern DBMS. A single statement sees a consistent view of all involved tables as long as the statement runs. It will never see any concurrent modifications. A **single** statement can't have phantom reads or non-repeatable reads. The only thing this statement does not take care of is when new "latest" rows are inserted while it runs - but as those will be invisible to the running statement changing `<>` to `<` won't change that as well – Nov 19 '16 at 20:04
Transaction isolation is a performance-vs-isolation compromise, and is all about locks (or MVCC-snapshot-to-be-used); as a result, they work in terms of execution plans and accessed rows, and not in terms of statements. That is, if we're not speaking about MySQL+ISAM which doesn't even have compliant transactions, leave alone transaction isolation (and which puts table-level locks on per statement basis, crazy stuff; probably it indeed doesn't matter for MySQL+ISAM, but as the OP wasn't mentioning MySQL+ISAM - I still suggest to change it to '<'). – No-Bugs Hare Nov 21 '16 at 09:42
@No-BugsHare: sorry, that's plain wrong. [Quote from the Oracle manual](http://docs.oracle.com/database/121/CNCPT/consist.htm#GUID-6A252EAE-7900-47DC-BD6A-D5376A6B7608) "*Oracle Database always enforces statement-level read consistency, which guarantees that data returned by a single query is committed and consistent for a single point in time*" and for the default isolation level read committed that single point in time is: "*the time at which the statement was **opened***" – Nov 21 '16 at 09:48
Ah, you're right about Oracle (and the same thing MIGHT stand for Postgres too - they're quite close with regards to transaction isolation). Still - while it can fly this way for an MVCC-based DB such as Oracle or Postgres, it is 100% not the case for MSSQL and DB/2 (both are lock-based by default, so they cannot possibly afford this kind of behaviour; from MSSQL doc: "Read uncommitted. Possible implementation: Transactions are not isolated from each other."). – No-Bugs Hare Nov 21 '16 at 13:29
Also - it is NOT required by the standard either (levels below "Serializable" are explicitly allowed to have "phantom reads"; Oracle exceeds standards requirements - which means that there is one more reason to use it, but it is still certainly not the only DB out there). Ergo, for an unspecified DB, '<' is still a better solution. – No-Bugs Hare Nov 21 '16 at 13:29
Firstly: the DBMS is not "unspecified". It's Oracle. Secondly: phantom reads as defined in the SQL standard only happen when a statement is run for the **second** time. Never within the execution of a **single** statement. – Nov 21 '16 at 13:36
DBMS is NOT specified in the original question (tags and comments are very weak indicators to avoid confusion). So, you should write it in big letters: VALID FOR ORACLE ONLY to avoid confusion. As for the standard - DB/2 and MSSQL are supposedly compliant (hey, the standard was originally written to match DB/2 :-)), and their locks cannot prevent bad behaviour with your statement. – No-Bugs Hare Nov 21 '16 at 13:47
"And with SQL Server's snapshot isolation introduced with 2005 it will also behave the same way" - yeah, it will (and also will with Serializable in both MSSQL and DB/2); however, there are still other isolations levels out there - which work differently. This is my whole point from the very beginning: why use the construct which requires some specific isolation levels when there is a construct (just 1 symbol less) which doesn't have this restriction? – No-Bugs Hare Nov 21 '16 at 14:05
Yeah but why NOT use – Infin8Loop Jun 18 '18 at 13:46

score 8 · Answer 2 · answered Aug 30 '11 at 11:51

You can use row_number() (or rank() or dense_rank(), or even just the rownum pseudocolumn) to apply an order to the records, and then use that order to decide which to discard. In this case, ordering by applicationdatetime desc gives the application with the most recent date for each student the rank of 1:

select studentid, applicationid from (
    select studentid, applicationid,
        row_number() over (partition by studentid
            order by applicationdatetime desc) as rn
    from application
)
where rn = 1;

 STUDENTID APPLICATIONID
---------- -------------
         1         20002
         2         20005

You can then delete anything with a rank higher than 1, which will preseve the records you care about:

delete from application
where (studentid, applicationid) in (
    select studentid, applicationid from (
        select studentid, applicationid,
            row_number() over (partition by studentid
                order by applicationdatetime desc) as rn
        from application
    )
    where rn > 1
);

3 rows deleted.

score 1 · Answer 3 · edited Jun 07 '16 at 20:22

At first you can do so

DELETE FROM [student]
           or [application]
WHERE (studentid, applicationid) NOT IN (SELECT StudentID
                                               ,MAX(ApplicationID)
                                         FROM student
                                             ,application
group by StudentID);

but there is another solution to, you can create the backup table, after delete all records in your tables and after insert your data (what you want) with max values select in your tables.

Delete all records except the most recent one?

3 Answers3

Linked

Related