6

In at least one application, I have the need to keep old versions of records in a relational database. When something should be updated, instead a new copy would be added and the old row would be marked as not current. When something should be deleted, it should instead be marked as not current or deleted.

There is a simple use case of this: New versions of a record can only be added at the current time, superseding one row each. This can be used for archiving previous records when saving new data. For this, I'd add the following columns to each table:

VersionTime datetime -- Time when this versions becomes effective
IsCurrent bool -- Indicates whether this version is the most current (and not deleted)

This is good if you only need to know what the most current version of a record is, and only enumerate previous versions of a single record separately. Point-in-time queries are even more painful than with the second variant.

A more generic variant is this: Versions of records can be added at any time for any specified validity time range. So I could declare that some setting of an entity is valid until end of 2013, and another version of it is valid in 2014, and yet another version will be valid from 2015 on. This can be used to both, archive old data (as above), and plan ahead to use different data at some time in the future (and to keep this information as an archive). For this, I'd add the following columns to each table:

ValidFrom datetime -- Time when this version becomes valid (inclusive)
ValidTo datetime -- Time when this version becomes invalid (exclusive)

The second approach can basically represent the first as well, but it's harder to know what version is the most recent - because you can also add versions for the future. Also, the ValidFrom/ValidTo design is able to declare overlapping ranges, and by definition, the row with the highest ValidFrom shall apply in that case.

Now I'm wondering how to implement an efficient solution to manage and query such data. Normally you can just write any SQL queries with any kind of WHERE, GROUP BY and JOIN to get the records you want. But with versioning applied, you need to consider the correct version of each record. So instead of joining every version of a record from another table, an appropriate condition must be added to only select the version that is valid at a given time.

An example:

SELECT a, b, c
FROM t1

Must be changed to:

SELECT a, b, c
FROM t1
WHERE t1.ValidFrom <= :time AND t1.ValidTo > :time
ORDER BY t1.ValidFrom
LIMIT 1

More complex with a table join:

SELECT a, b, c
FROM t1
    LEFT JOIN t2 ON (t2.a = t1.a)

Must be changed to:

SELECT a, b, c
FROM t1
    LEFT JOIN t2 ON (t2.a = t1.a)
WHERE t1.ValidFrom <= :time AND t1.ValidTo > :time
    AND t2.ValidFrom <= :time AND t2.ValidTo > :time

This still doesn't handle selecting the right version of overlapping time spans. I could add some clean-up method that flattens out overlapping version time ranges, but I don't know how efficient that would be.

I'm seeking to create a class (in C# in my case) that provides methods to read and write such versioned records. The writing is relatively easy because the queries are simple and easy to control with transactions. but querying would require building an API that accepts every fragment of an SQL SELECT query and intelligently builds the SQL query to execute from that. Thie query method should only accept one additional parameter that specifies the time to fetch the data from. Depending on each entity's validity range, different versions would be selected of each.

These are basically my incomplete thoughts about versioning data and providing an API to manage it. Have you already done such a thing and would like to tell me what you think of it? Do you have another idea that worked well? Could you offer me any advice on how to implement this API? While I theoretically know how to do it, I think it's a lot of work and I can't estimate how well it will work.

Bob Wintemberg
  • 3,212
  • 6
  • 34
  • 44
ygoe
  • 18,655
  • 23
  • 113
  • 210
  • I can't pretend to fully understand your question ... why don't you have an audit table for each table that is populated with a date stamped record every time an update, insert or delete happens - and get your data from there? I guess I must be missing something. – Martin Smellworse Jan 27 '14 at 15:01
  • Do I understand you right: You mean duplicating the table and keep the most-recently written data in one, and all previous versions (with a date) in the other table? This would only cover the first, simpler scenario, not the second. Also, it doesn't say anything about executing a point-in-time query. – ygoe Jan 27 '14 at 15:09
  • 1
    See https://stackoverflow.com/questions/39281. – Marco Eckstein Nov 01 '20 at 16:30

4 Answers4

2

If you need old data being part of your business logic then:

  • Save latest version in master table.(insert and update, delete will just change the status column)
  • Take snapshot when an update happens in detail table(before any update an snapshot will be created).

revision history

If old data is just a trace log of changes then:

Community
  • 1
  • 1
Mohsen Heydari
  • 7,256
  • 4
  • 31
  • 46
2

I Know that this is an old post, But I wanted to reply not only to provide solution but also to exchange my ideas with you and also to discuss the most efficient solution for this important issue of versioning.

My idea is,

Create a table that contains 5 main versioning fields

  • Serial (Incremental number) is the real identifier and used for joins
  • ID (Self-Foreign key) is equal to the (Serial) Field value when the record is created
  • ValidFrom (Data from which the record became active)
  • ValidTo (Data to which the record became inactive) => Will be null for a current version
  • IsCurrent (Flag indicating that record is active)

When updating a record

  • Update the field to set (ValidTo) to be NOW datetime and set (IsCurrent) to false

  • Insert a new record by increment the (Serial) Field and keeping the very same Field (ID) of the updated record, (ValidFrom) will be NOW and (ValidTo) will be null and IsCurrent will have false.

When Deleting record

ValidTo will be set to NOW time IsCurrent set to false

by this way you will not have problems with joins as joining tables with field ID will show you all record history.

IF you have FKs to a parent table , You probably want to remove the value of the FK field.

1

I am working with SQL within Oracle products (Database 11g). We have huge project and versioning is an essential part of its. Both approach you mentioned are useful.
If your database support triggers and you can use PL/SQL, you can separate old data with a small dose of effort. You can create before update and before delete triggers, then store all older data inside special historical table (with date of change and type - delete or update)

Assumption: All tables you want to versioning must have primary key.

Pseudocode:

CREATE TRIGGER TRIGGER_ON_VERSIONED_TABLE
BEFORE UPDATE
  ON VERSIONED_TABLE
BEGIN 
  INSERT INTO VERSIONED_TABLE_HISTORY_PART VALUES (:OLD.COLUMN_A, USER, TIMESTAMP);
END

If you want all historical data about one primary key, you can select data from "production" table and historical table select only key you want and sort by timestamp (for active record will be timestamp SYSTIMESTAMP). And if you want to see in which state is which record, you can select first row for which your date is higher than date in history (or production table).

For before update trigger look here.

If you have existing solution
(so, your original DB model does not contain versioning parts)
and you want to create versioned table, or you can not use PL/SQL, use your approach 2. Our project at work (on Oracle Database) use this approach also. Let say we have table with documents (in real life you you have a version identifier which will be primary key for this table, but this is only to show principles)

CREATE TABLE DOC(
    DOC_NAME    VARCHAR(10)
  , DOC_NOTE    VARCHAR(10)
  , VALID_FROM  TIMESTAMP
  , VALID_TO    TIMESTAMP
  , CONSTRAINT DOC_PK PRIMARY KEY(DOCUMENT_NAME, VALID_FROM)
);

INSERT INTO doc VALUES ('A', 'FIRST VER', systimestamp, date'2999-12-31');
INSERT INTO doc VALUES ('B', 'FIRST VER', systimestamp, date'2999-12-31');

You don't need where like this:

WHERE VALID_FROM <= :time AND VALID_TO > :time
ORDER BY VALID_FROM LIMIT 1

Because in versioned table, only one version of record is valid to any time. So, you need only this:

SELECT * FROM DOC 
WHERE SYSTIMESTAMP BETWEEN VALID_FROM AND VALID_TO;

This always return you only one row and you can use instead of SYSTIMESTAMP any other date. But you can not update records directly, first, you must update end timestamp (but this is not problem for you, as I see). So if I update XK-04, I do it this way:

UPDATE doc SET VALID_TO = systimestamp 
WHERE DOC_NAME='A' AND SYSTIMESTAMP BETWEEN VALID_FROM AND VALID_TO;
INSERT INTO doc VALUES ('A', 'SECOND VER', systimestamp, date'2999-12-31');

And you can use the same select as above again.

SELECT * FROM DOC WHERE :CUSTOM_DATE BETWEEN VALID_FROM AND VALID_TO;

Best practice is create for versioned table also ACTIVE and HISTORICAL views. In base table you have all data and anytime you want actual record you must write BETWEEN VALID_FROM AND VALID_TO. Better way is create views:

CREATE VIEW DOC_ACTIVE 
AS SELECT * FROM DOC WHERE SYSTIMESTAMP BETWEEN VALID_FROM AND VALID_TO;

Or, if you need for old data:

CREATE VIEW DOC_INACTIVE 
AS SELECT * FROM DOC WHERE NOT SYSTIMESTAMP BETWEEN VALID_FROM AND VALID_TO;

Now, instead of your original SQL:

SELECT a, b, c FROM t1

you dont need use complicated structure, only change table to "active" view (like DOC_ACTIVE):

SELECT a, b, c FROM t1_VIEW

Please, look on this answer also: Versioning in SQL Tables - how to handle it?

I do not know whether you see the difference between valid record and valid "object" inside. In our work project we don't have any valid overlapping ranges .. for example, said table with documents, primary key composite from document name and version number ... We have document A (and this document is valid for years 2010 - 2050) it has 2 versions.

Document A, version 1 (2010-2020), record valid 2014-9999: VALID   (NEW)
Document A, version 2 (2021-2050), record valid 2014-9999: VALID   (NEW)

In version 1 is document valid from 2010 to 2020 (object version, not record version) The document in some state P. This record is valid from 2014-9999.

In version 2 document is valid from 2021 to 2050 (object version, not record version) This record is valid again between 2014-9999. And document is in state Q.

Let's say it is 2016. You find clerical error in both versions of document. You create to actual year (2016) new record version for both document versions. After all changes you have this document versions:

Document A, version 1 (2010-2020), record valid 2014-2015: INVALID   (UPDATED)
Document A, version 2 (2021-2050), record valid 2014-2015: INVALID   (UPDATED)
Document A, version 1 (2010-2020), record valid 2016-9999: VALID NOW (NEW)
Document A, version 2 (2021-2050), record valid 2016-9999: VALID NOW (NEW)

After this, in year 2018, someone create new version of document, valid only for years 2021-2030. (The document is valid in the future, but his version is valid today) Now you must update VALID version 2 and create version 3. Actual state:

Document A, version 1 (2010-2020), record valid 2014-2015: INVALID   (NO CHANGE)
Document A, version 2 (2021-2050), record valid 2014-2015: INVALID   (NO CHANGE)
Document A, version 1 (2010-2020), record valid 2016-9999: VALID NOW (NO CHANGE)
Document A, version 2 (2021-2050), record valid 2016-2018: INVALID   (UPDATED)
Document A, version 2 (2031-2050), record valid 2018-9999: VALID NOW (NEW)
Document A, version 3 (2021-2030), record valid 2018-9999: VALID NOW (NEW)

All this operations for us in our work project do PL/SQL code.
In year 2018 if you select document for valid records you get 3 rows: A1 A2 A3.
If you select versions valid in year 2015 you get only A1(INVALID) A2(INVALID).

So, you have full history, even if the document has 3 valid version, valid in the same point (record validity). And object validity is separated. This is a really good approach and must cover all your requirements.

You can easy use BETWEEN in VIEWS also for columns with NULL (indicated minimum or maximum values) like this:

CREATE VIEW DOC_ACTIVE AS
SELECT * FROM DOC 
 WHERE SYSTIMESTAMP BETWEEN NVL(VALID_FROM, SYSTIMESTAMP) 
                        AND NVL(VALID_TO, SYSTIMESTAMP);
Community
  • 1
  • 1
Atiris
  • 2,613
  • 2
  • 28
  • 42
  • Yes, I have primary keys. Interesting link at the end. But your description does not handle overlapping valid ranges and future valid ranges of my second (more comprehensive) variant. Also, BETWEEN cannot handle NULL values for infinite start/end times, hence my two conditions for that. – ygoe Jan 27 '14 at 17:03
  • I updated my answer, because I needed to write more than will fit into this comment. – Atiris Jan 27 '14 at 17:57
  • Okay, that's a different definition of validity ranges than mine. I don't distinguish between data record and content. Take a set of rules for example. Until April, a certain set of rules is valid. After that, different rules shall be applied. And maybe in November only, a temporary ruleset is used. That's 3 versions of the ruleset. Also I'm a bit hesitant about using views in queries. I've made bad experiences with that in terms of performance. Oracle can't use keys from the source of a view which makes things slower than necessary. (But I'm not Oracle-specific here.) – ygoe Jan 27 '14 at 18:35
  • Oracle optimize for execution entire statement. So if you have view `A_VIEW` as `SELECT X, Y FROM A_TAB WHERE Y>2000` and then call this view like `SELECT X FROM A_VIEW WHERE Y<2001`, Oracle evaluate this taking account of all the conditions. So it is the same as you write `SELECT X FROM A_TAB WHERE Y>2000 AND Y<2001` .. until you use secific left joins or UNION in views :) and I think this is universal approach for using views in SQL. .. ok, but I guess I help a little – Atiris Jan 27 '14 at 18:58
0

I have worked with tracking versions of records but never with overlapping ranges. However, I have experience selecting records under similar criteria. Here's a query that should do what you want.

select  *
from    t1
where   VersionId = (select top 1 VersionId
                     from   t1 as MostRecentlyValid
                     where  MostRecentlyValid.ValidFrom <= @AsOfDate
                            and (MostRecentlyValid.ValidTo >= @AsOfDate
                                 or MostRecentlyValid.ValidTo is null)
                            and t1.Id = MostRecentlyValid.Id
                     order by MostRecentlyValid.ValidFrom desc)

This assumes that ValidTo can also be null to indicate no end date. If ValidTo can't be null then you can remove the or condition. This also assumes a record is valid through the end of the day of the ValidTo date. If the record becomes old at the beginning of the day of the ValidTo date change >= to just >.

This worked for the handful and test data I tried it on but I'm fairly certain it'll work for all cases.

As for as efficiency, I'm not a SQL expert so I really don't know if this is the most efficient solution.

To join to another table you could do something like this

select  *
from    (select *
         from  t1
         where VersionId = (select  top 1 VersionId
                from  t1 as MostRecentlyValid
                where MostRecentlyValid.ValidFrom <= '2014/2/11'
                      and (MostRecentlyValid.ValidTo >= '2014/2/1'
                           or MostRecentlyValid.ValidTo is null)
                      and t1.Id = MostRecentlyValid.Id
                      order by MostRecentlyValid.ValidFrom desc ) ) as SelectedRecords
         inner join t2
            on SelectedRecords.Id = t2.Id
Skye MacMaster
  • 894
  • 1
  • 10
  • 19