Selecting most recent and specific version in each group of records, for multiple groups

Question

The problem:
I have a table that records data rows in foo. Each time the row is updated, a new row is inserted along with a revision number. The table looks like:

id  rev field
1   1   test1
2   1   fsdfs
3   1   jfds
1   2   test2

Note: the last record is a newer version of the first row.

Is there an efficient way to query for the latest version of a record and for a specific version of a record?

For instance, a query for rev=2 would return the 2, 3 and 4th row (not the replaced 1st row though) while a query for rev=1 yields those rows with rev <= 1 and in case of duplicated ids, the one with the higher revision number is chosen (record: 1, 2, 3).

I would not prefer to return the result in an iterative way.

MS SQL 2008 R2, but ideally I would want to generate a query that doesn't depend on anything sql server specifics. — orange, Feb 24 '12 at 12:51
I've just noticed your 'in case of duplicated ids' clauses - could you expand on this? Not sure what it means. — AakashM, Feb 24 '12 at 13:32
Possible duplicate question here. Please read the discussion there as is very educative: https://stackoverflow.com/questions/121387/fetch-the-row-which-has-the-max-value-for-a-column — pablete, Jun 11 '20 at 19:36

Tim · Accepted Answer · 2012-02-25T11:40:00.927

58

To get only latest revisions:

SELECT * from t t1
WHERE t1.rev = 
  (SELECT max(rev) FROM t t2 WHERE t2.id = t1.id)

To get a specific revision, in this case 1 (and if an item doesn't have the revision yet the next smallest revision):

SELECT * from foo t1
WHERE t1.rev = 
  (SELECT max(rev) 
   FROM foo t2 
   WHERE t2.id = t1.id
   AND t2.rev <= 1)

It might not be the most efficient way to do this, but right now I cannot figure a better way to do this.

edited Feb 25 '12 at 11:40

answered Feb 24 '12 at 12:34

Tim

2,051
17
36

Thanks for the reply. This works well. Why do you think it's inefficient? – orange Feb 24 '12 at 12:48
1

Because of the nested query. Basically for each row in t you have to do the second query. – Tim Feb 24 '12 at 12:50
8

@Tim - Not true. SQL is declarative not imperative. In this case SQL Server knows this pattern and the plan is surprisingly simple. [plan image](http://i.stack.imgur.com/MUb0m.jpg) – Martin Smith Feb 24 '12 at 13:45
How would you go about a particular version (which is in essence all rev <= xyz)? 'SELECT * from data t1 WHERE t1.rev <= 1 AND t1.rev = (SELECT max(rev) FROM data t2 WHERE t2.id = t1.id)' only returns row 2 and 3. – orange Feb 24 '12 at 21:07
Try to add the t1.revision <=1 to the subquery instead – Tim Feb 24 '12 at 23:37
I updated my answer and added a working example for revision <= 1. Sorry, had to be t2.rev, not t1.rev. – Tim Feb 25 '12 at 11:40
1

Just wanted to add that if you end up having performance issues with `(SELECT max(rev) ...` then a simple solution is to create a `t_latest` table that you update on insert into `t`, then getting all latest revisions is just a join between `t_latest` and `t`. Slightly more costly on insert, but significantly faster queries. On 10,000 entries, with 1000 revisions, on Mac M1 Postgres, select max approach took 2mins, with join it took 1.2s. – andre_b Oct 18 '21 at 16:43
@andre_b so, schema is like, $t(id, rev, field)$ and t_latest(id, max_rev). Right?? What if we have t_latest(id, max_rev, field)? In this case, we are not required to do the join. Am I missing anything? – Sarvesh Pandey Jul 23 '23 at 14:14

Tony O'Hagan · Answer 2 · 2022-08-13T07:44:34.640

Here's an alternative solution that incurs an update cost but is much more efficient for reading the latest data rows as it avoids computing MAX(rev). It also works when you're doing bulk updates of subsets of the table. I needed this pattern to ensure I could efficiently switch to a new data set that was updated via a long running batch update without any windows of time where we had partially updated data visible.

Aging

Replace the rev column with an age column
Create a view of the current latest data with filter: age = 0
To create a new version of your data ...
INSERT: new rows with age = -1 - This was my slow long running batch process.
UPDATE: UPDATE table-name SET age = age + 1 for all rows in the subset. This switches the view to the new latest data (age = 0) and also ages older data in a single transaction.
DELETE: rows having age > N in the subset - Optionally purge old data

Indexing

Create a composite index with age and then id so the view will be nice and fast and can also be used to look up by id. Although this key is effectively unique, its temporarily non-unique when you're ageing the rows (during UPDATE SET age=age+1) so you'll need to make it non-unique and ideally the clustered index. If you need to find all versions of a given id ordered by age, you may need an additional non-unique index on id then age.

Rollback

Finally ... Lets say you're having a bad day and the batch processing breaks. You can quickly revert to a previous data set version by running:

UPDATE table-name SET age = age - 1 -- Roll back a version
DELETE table-name WHERE age < 0 -- Clean up bad stuff

Existing Table

Suppose you have an existing table that now needs to support aging. You can use this pattern by first renaming the existing table, then add the age column and indexing and then create the view that includes the age = 0 condition with the same name as the original table name.

This strategy may or may not work depending on the nature of technology layers that depended on the original table but in many cases swapping a view for a table should drop in just fine.

Notes

I recommend naming the age column to RowAge in order to indicate this pattern is being used, since it's clearer that its a database related value and it complements SQL Server's RowVersion naming convention. It also won't conflict with a column or view that needs to return a person's age.

Unlike other solutions, this pattern works for non SQL Server databases.

If the subsets you're updating are very large then this might not be a good solution as your final transaction will update not just the current records but all past version of the records in this subset (which could even be the entire table!) so you may end up locking the table.

An interesting alternative. Can I just clarify this line "_UPDATE table-name SET age = age + 1 for all rows in the subset_" - what subset are you referring to, shouldn't it just be all rows fullstop? — RyanfaeScotland, Feb 20 '19 at 09:42
In my case I needed to apply aging to a dynamic subset of the table so I was not aging the entire table. If that's not be your scenario then you can remove the subset condition from your queries. — Tony O'Hagan, Feb 20 '19 at 11:27
Cheers Tony, I missed the line at the start "_It also works when you're doing bulk updates of subsets of the table_" as well which added to my bit of confusion. All makes sense now. — RyanfaeScotland, Feb 20 '19 at 12:38
Thanks for sharing this clever pattern, deserves more votes. — Christoffer Bubach, Jul 13 '19 at 16:47
If your fortunate enough to be using a DB engine that supports partial indexing, you might only wish to index the subset of rows having `age = 0` — Tony O'Hagan, Jul 20 '19 at 05:39
Be aware that performing a whole of table update is likely to lock the table for the period of the transaction. This might exclude this solution during "operational hours" if your table is very large and requires consistent performance. — Tony O'Hagan, May 19 '21 at 00:14
Not that this is specific to this particular answer, but how would you track the ID if you're using auto increment? Seems like you'd need another table just to keep track of the series ID. — Rei Miyasaka, Sep 28 '21 at 23:58
@ReiMiyasaka Any multi-version solution can't use auto increment in the same table so Yes you'd either use a separate table OR you'd compute NEW_ID = MAX(ID) +1 and retry the INSERT when it fails due to concurrent INSERT attempts returning the same NEW_ID. You may wish to have an additional ID column on the table that is unique to the row and autoincrements. — Tony O'Hagan, Oct 08 '21 at 23:23

score 11 · Answer 3 · answered Feb 24 '12 at 13:18

This is how I would do it. ROW_NUMBER() requires SQL Server 2005 or later

Sample data:

DECLARE @foo TABLE (
    id int,
    rev int,
    field nvarchar(10)
)

INSERT @foo VALUES
    ( 1, 1, 'test1' ),
    ( 2, 1, 'fdsfs' ),
    ( 3, 1, 'jfds' ),
    ( 1, 2, 'test2' )

The query:

DECLARE @desiredRev int

SET @desiredRev = 2

SELECT * FROM (
SELECT 
    id,
    rev,
    field,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rn
FROM @foo WHERE rev <= @desiredRev 
) numbered
WHERE rn = 1

The inner SELECT returns all relevant records, and within each id group (that's the PARTITION BY), computes the row number when ordered by descending rev.

The outer SELECT just selects the first member (so, the one with highest rev) from each id group.

Output when @desiredRev = 2 :

id          rev         field      rn
----------- ----------- ---------- --------------------
1           2           test2      1
2           1           fdsfs      1
3           1           jfds       1

Output when @desiredRev = 1 :

id          rev         field      rn
----------- ----------- ---------- --------------------
1           1           test1      1
2           1           fdsfs      1
3           1           jfds       1

Seems to be working fine. I like that you can select the revision number that easily. However, I would prefer a plain SQL query if possible. — orange, Feb 24 '12 at 22:47

Treb · Answer 4 · 2012-02-27T08:54:54.783

6

If you want all the latest revisions of each field, you can use

SELECT C.rev, C.fields FROM (
  SELECT MAX(A.rev) AS rev, A.id
  FROM yourtable A
  GROUP BY A.id) 
AS B
INNER JOIN yourtable C
ON B.id = C.id AND B.rev = C.rev

In the case of your example, that would return

 rev field
 1   fsdfs   
 1   jfds   
 2   test2

edited Feb 27 '12 at 08:54

answered Feb 24 '12 at 12:35

Treb

19,903
7
54
87

Cannot find either column "A" or the user-defined function or aggregate "A.MAX", or the name is ambiguous. This error message comes up, but it doesn't help me too that much... – orange Feb 24 '12 at 13:01
you do not need the A. before the MAX, just use max like this: MAX(A.rev) or max(rev) – Tim Feb 24 '12 at 13:07
Oops, my mistake! Corrected the query. – Treb Feb 24 '12 at 13:10
@user1230724: Arghh! You are right, of course. Corrected the query (again...) – Treb Feb 27 '12 at 08:55

score 3 · Answer 5 · edited Jul 24 '12 at 06:53

3

SELECT
  MaxRevs.id,
  revision.field
FROM
  (SELECT
     id,
     MAX(rev) AS MaxRev
   FROM revision
   GROUP BY id
  ) MaxRevs
  INNER JOIN revision 
    ON MaxRevs.id = revision.id AND MaxRevs.MaxRev = revision.rev

edited Jul 24 '12 at 06:53

bluish

26,356
27
122
180

answered Feb 24 '12 at 15:08

Pittsburgh DBA

6,672
2
39
68

To also select a specific revision: SELECT revision.* FROM (SELECT id, MAX(rev) AS MaxRev FROM revision WHERE rev <= 2 GROUP BY id ) MaxRevs INNER JOIN revision ON MaxRevs.id = revision.id AND MaxRevs.MaxRev = revision.rev – Mahe Mar 05 '18 at 07:58

score 2 · Answer 6 · edited Feb 24 '12 at 12:42

2

SELECT foo.* from foo 
left join foo as later 
on foo.id=later.id and later.rev>foo.rev 
where later.id is null;

edited Feb 24 '12 at 12:42

Treb

19,903
7
54
87

answered Feb 24 '12 at 12:38

crimaniak

123
8

score 1 · Answer 7 · edited May 18 '18 at 09:24

1

How about this?

select id, max(rev), field from foo group by id

For querying specific revision e.g. revision 1,

select id, max(rev), field from foo where rev <= 1 group by id

edited May 18 '18 at 09:24

answered May 18 '18 at 08:51

Joonhui Kim

11
1