40

I am looking to design a database that keeps track of every set of changes so that I can refer back to them in the future. So for example:

Database A 

+==========+========+==========+
|   ID     |  Name  | Property |

     1        Kyle      30

If I change the row's 'property' field to 50, it should update the row to:

1    Kyle    50

But should save the fact that the row's property was 30 at some point in time. Then if the row is again updated to be 70:

1    Kyle    70

Both facts that the row's property was 50 and 70 should be preserved, such that with some query I could retrieve:

1    Kyle    30
1    Kyle    50

It should recognize that these were the "same entries" just at different points in time.

Edit: This history will need to be presented to the user at some point in time so ideally, there should be an understanding of which rows belong to the same "revision cluster"

What is the best way to approach the design of this database?

lospejos
  • 1,976
  • 3
  • 19
  • 35
Delos Chang
  • 1,823
  • 3
  • 27
  • 47
  • 1
    Does your application need to understand the history (ie present this history to the end user), or is it auditing purposes? – Matthew Jun 12 '13 at 21:23
  • Is it a requirement that this is stored in the DB? Usually this is done at the application so it can be in version control and be applied between multiple developers. – Daniel Kaplan Jun 12 '13 at 21:23
  • Yes, the application would need to present this history to the user. – Delos Chang Jun 12 '13 at 21:27
  • You might want to use a blockchain in this case as it preserves the change history to the underlying database and allows you to navigate to previous commit - useful for establishing an audit trail. – morpheus Aug 27 '19 at 00:28

4 Answers4

33

One way is to have a MyTableNameHistory for every table in your database, and make its schema identical to the schema of table MyTableName, except that the Primary Key of the History table has one additional column named effectiveUtc as a DateTime. For example, if you have a table named Employee,

Create Table Employee
{
  employeeId integer Primary Key Not Null,
  firstName varChar(20) null,
  lastName varChar(30) Not null,
  HireDate smallDateTime null,
  DepartmentId integer null
}

Then the History table would be

Create Table EmployeeHistory
{
  employeeId integer Not Null,
  effectiveUtc DateTime Not Null,
  firstName varChar(20) null,
  lastName varChar(30) Not null,
  HireDate smallDateTime null,
  DepartmentId integer null,
  Primary Key (employeeId , effectiveUtc)
}

Then, you can put a trigger on Employee table, so that every time you insert, update, or delete anything in the Employee table, a new record is inserted into the EmployeeHistory table with the exact same values for all the regular fields, and current UTC datetime in the effectiveUtc column.

Then to find the values at any point in the past, you just select the record from the history table whose effectiveUtc value is the highest value prior to the asOf datetime you want the value as of.

 Select * from EmployeeHistory h
 Where EmployeeId = @EmployeeId
   And effectiveUtc =
    (Select Max(effectiveUtc)
     From EmployeeHistory 
     Where EmployeeId = h.EmployeeId
        And effcetiveUtc < @AsOfUtcDate) 
Charles Bretana
  • 143,358
  • 22
  • 150
  • 216
  • 1
    But... the query is inefficient and the schema doesn't allow to readily get OP's revision cluster (as in using easily writable queries), i.e. to know when a department appeared, when it was deleted, and when it was recreated... with all of the employees within it at time `t`. – Denis de Bernardy Jun 12 '13 at 21:44
  • The query can be simplified using postgres `DICTINCT ON ()` clause. Something like `Select distinct on (EmployeeId) * from EmployeeHistory where effectiveUtc <= @AsOfUtcDate order by EmployeeId, effectiveUtc desc` – Ihor Romanchenko Jun 13 '13 at 09:37
  • @Igor Typo: `DISTINCT ON ()` (because thoughtless copy-paste is a bad habit of mine) – luckydonald Jan 25 '16 at 03:47
  • Why doesn't the History table just become the normal table? Then we can always just have one table and still have all our data – Worthy7 Jan 19 '17 at 01:01
  • Because the query to extract the latest record is inefficient, and if that query is used very frequently, and much more frequently that queries for historical data, then it makes sense to separate the current values into their own (much smaller) table. – Charles Bretana Jan 19 '17 at 16:36
  • If you don't want to manage a History table for every other table in your database, you can try using EAV. See [my answer](http://stackoverflow.com/a/42589834/2479481) – Luke Willis Mar 03 '17 at 23:01
16

To add onto Charles' answer, I would use an Entity-Attribute-Value model instead of creating a different history table for every other table in your database.

Basically, you would create one History table like so:

Create Table History
{
  tableId varChar(64) Not Null,
  recordId varChar(64) Not Null,
  changedAttribute varChar(64) Not Null,
  newValue varChar(64) Not Null,
  effectiveUtc DateTime Not Null,
  Primary Key (tableId , recordId , changedAttribute, effectiveUtc)
}

Then you would create a History record any time you create or modify data in one of your tables.

To follow your example, when you add 'Kyle' to your Employee table, you would create two records (one for each non-id attribute), and then you would create a new record every time a property changes:

History 
+==========+==========+==================+==========+==============+
| tableId  | recordId | changedAttribute | newValue | effectiveUtc |
| Employee | 1        | Name             | Kyle     | N            |
| Employee | 1        | Property         | 30       | N            |
| Employee | 1        | Property         | 50       | N+1          |
| Employee | 1        | Property         | 70       | N+2          |

Alternatively, as a_horse_with_no_name suggested in this comment, if you don't want to store a new History record for every field change, you can store grouped changes (such as changing Name to 'Kyle' and Property to 30 in the same update) as a single record. In this case, you would need to express the collection of changes in JSON or some other blob format. This would merge the changedAttribute and newValue fields into one (changedValues). For example:

History 
+==========+==========+================================+==============+
| tableId  | recordId | changedValues                  | effectiveUtc |
| Employee | 1        | { Name: 'Kyle', Property: 30 } | N            |

This is perhaps more difficult than creating a History table for every other table in your database, but it has multiple benefits:

  • adding new fields to tables in your database won't require adding the same fields to another table
  • fewer tables used
  • It's easier to correlate updates to different tables over time

One architectural benefit of this design is that you are decoupling the concerns of your app and your history/audit capabilities. This design would work just as well as a microservice using a relational or even NoSQL database that is separate from your application database.

Luke Willis
  • 8,429
  • 4
  • 46
  • 79
  • 4
    It's probably more efficient to store all row values in a single JSON or hstore column instead of one row for each modified column. e.g. following the pattern used in various audit triggers see: http://okbob.blogspot.de/2015/01/most-simply-implementation-of-history.html or http://8kb.co.uk/blog/2015/01/19/copying-pavel-stehules-simple-history-table-but-with-the-jsonb-type/ or http://cjauvin.blogspot.de/2013/05/impossibly-lean-audit-system-for.html –  Mar 03 '17 at 23:06
  • @a_horse_with_no_name Yep. That would definitely work too. I'll add a note explaining that option as well. Thanks! – Luke Willis Mar 24 '17 at 17:23
  • History.newValue column name shouldn't be History.oldValue? That is a history table and values it tells about should be old. The proper table keeps newValue (a current one), correct me if I am wrong. – Marecky Jun 22 '18 at 22:22
  • I just wanted to prepare my first EAV table for my 30 columns table, and then I realized that I had fought many weeks to keep proper data types. EAV loses them without returning, and saving the data type in the next column of the History table complicates this matter incommensurably – Marecky Jun 22 '18 at 22:30
  • 2
    @Marecky you can do either old or new value. I like to use new value and initialize the table with a seed record that contains the current value at time of initialization. You can also just record old value whenever a change is made, and rely on the actual table's data for current. – Luke Willis Jun 26 '18 at 00:24
  • @Marecky for type changes, I don't like tracking those in my history table. I consider that a schema change and would just let my version control capture that. – Luke Willis Jun 26 '18 at 00:25
  • @LukeMWillis Yes, I went the historical way (keeping only old values) and I managed to keep type by preparing separate EAV tables for every data type. – Marecky Jun 26 '18 at 12:30
  • Doesn’t this method work only if all attributes that might change are character strings, here varChar(64)? If some attributes that might change have a different type, would you have separate type-specific history tables? – Jim Ratliff Sep 11 '22 at 06:20
  • 1
    @JimRatliff if the type of the changed column isn't varchar you would still store the data as a string. If you need to respect types, this won't work. This approach is best if you're only using this as a record/log of history. – Luke Willis Sep 13 '22 at 22:35
6

The best way depends on what you're doing. You want to look more deeply into slowly changing dimensions:

https://en.wikipedia.org/wiki/Slowly_changing_dimension

In Postgres 9.2 don't miss the tsrange type, too. It allows to merge start_date and end_date into a single column, and to index the stuff with a GIST (or GIN) index alongside an exclude constraint to avoid overlapping date ranges.


Edit:

there should be an understanding of which rows belong to the same "revision cluster"

In this case you want date ranges in a way or another in your tables, rather than revision numbers or live flags, else you'll end up duplicating related data all over the place.

On a separate note, consider discriminating the audit tables from the live data, rather than storing everything in the same table. It's harder to implement and manage, but it makes for far more efficient queries on the live data.


See this related post, too: Temporal database design, with a twist (live vs draft rows)

Community
  • 1
  • 1
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
1

One of the ways to log all the changes is to create so called audit triggers. Such triggers can log any change to the table they are on to a separate log table (that can be queried to see the history of the changes).

Details on the implementation here.

Ihor Romanchenko
  • 26,995
  • 8
  • 48
  • 44