Database Historization

Question

We have a requirement in our application where we need to store references for later access.

Example: A user can commit an invoice at a time and all references(customer address, calculated amount of money, product descriptions) which this invoice contains and calculations should be stored over time.

We need to hold the references somehow but what if the e.g. the product name changes? So somehow we need to copy everything so its documented for later and not affected by changes in future. Even when products are deleted, they need to reviewed later when the invoice is stored.

What is the best practise here regarding database design? Even what is the most flexible approach e.g. when the user want to edit his invoice later and restore it from the db?

Thank you!

score 11 · Accepted Answer · answered Jun 11 '12 at 21:58

Here is one way to do it:

enter image description here

Essentially, we never modify or delete the existing data. We "modify" it by creating a new version. We "delete" it by setting the DELETED flag.

For example:

If product changes the price, we insert a new row into PRODUCT_VERSION while old orders are kept connected to the old PRODUCT_VERSION and the old price.
When buyer changes the address, we simply insert a new row in CUSTOMER_VERSION and link new orders to that, while keeping the old orders linked to the old version.
If product is deleted, we don't really delete it - we simply set the PRODUCT.DELETED flag, so all the orders historically made for that product stay in the database.
If customer is deleted (e.g. because (s)he requested to be unregistered), set the CUSTOMER.DELETED flag.

Caveats:

If product name needs to be unique, that can't be enforced declaratively in the model above. You'll either need to "promote" the NAME from PRODUCT_VERSION to PRODUCT, make it a key there and give-up ability to "evolve" product's name, or enforce uniqueness on only latest PRODUCT_VER (probably through triggers).
There is a potential problem with the customer's privacy. If a customer is deleted from the system, it may be desirable to physically remove its data from the database and just setting CUSTOMER.DELETED won't do that. If that's a concern, either blank-out the privacy-sensitive data in all the customer's versions, or alternatively disconnect existing orders from the real customer and reconnect them to a special "anonymous" customer, then physically delete all the customer versions.

This model uses a lot of identifying relationships. This leads to "fat" foreign keys and could be a bit of a storage problem since MySQL doesn't support leading-edge index compression (unlike, say, Oracle), but on the other hand InnoDB always clusters the data on PK and this clustering can be beneficial for performance. Also, JOINs are less necessary.

Equivalent model with non-identifying relationships and surrogate keys would look like this:

enter image description here

To have unique product names, you can add a table with only product names, where the name is the pk, and link to that table from PRODUCT_VERSION — Owe Jessen, Jan 26 '15 at 12:16
@OweJessen Sure you could have LATEST_PRODUCT_VERSION table with unique NAME, but that doesn't count as "declarative" solution, since you would need to manually insert and delete rows in that table as new product versions are created. Unless you are using a DBMS that can both automatically update materialized views and enforce uniqueness on them (such as MS SQL Server's indexed views), so the DBMS itself maintains LATEST_PRODUCT_VERSION for you. — Branko Dimitrijevic, Jan 26 '15 at 17:03

score 1 · Answer 2 · answered Jun 11 '12 at 19:55

1

You could add a column in the product table indicating whether or not it is being sold. Then when the product is "deleted" you just set the flag so that it is no longer available as a new product, but you retain the data for future lookups.

To deal with name changes, you should be using ID's to refer to products rather than using the name directly.

answered Jun 11 '12 at 19:55

user812786

4,302
5
38
50

That second point won't help him if he wants to historically include the product name as it was when it was sold. Say the same ItemId applies to "Coke" until 1988 and then to "Coke Classic" afterward, and he wants to know that it was called "Coke" on older orders. Your suggestion is what he's trying to avoid - any report he generates using normalized data will show "Coke Classic" on all orders, pre-1988 or not. – David Jun 11 '12 at 20:08
True .. one way to deal with that would be to create a new product for name changes, or add another table to track product names (which approaches @Branko Dimitrijevic's solution). – user812786 Jun 12 '12 at 12:55

David · Answer 3 · 2012-06-11T20:24:26.800

The problem you're facing is, as I'm sure you know, a result of Database Normalization. One of the approaches to resolve this can be taken from Business Intelligence techniques - archiving the data ina de-normalized state in a Data Warehouse.

Normalized data:

Orders table
- OrderId
- CustomerId
Customers Table
- CustomerId
- Firstname
- etc
Items table
- ItemId
- Itemname
- ItemPrice
OrderDetails Table
- ItemDetailId
- OrderId
- ItemId
- ItemQty
- etc

When queried and stored de-normalized, the data warehouse table looks like

OrderId
CustomerId
CustomerName
CustomerAddress
(other Customer Fields)
ItemDetailId
ItemId
ItemName
ItemPrice
(Other OrderDetail and Item Fields)

Typically, there is either some sort of scheduled job that pulls data from the normalized datas into the Data Warehouse on a scheduled basis, OR if your design allows, it could be done when an order reaches a certain status. (Such as shipped) It could be that the records are stored at each change of status (with a field called OrderStatus tacking the current status), so the fully de-normalized data is available for each step of the oprder/fulfillment process. When and how to archive the data into the warehouse will vary based on your needs.

There is a lot of overhead involved in the above, but the other common approach I'm aware of carries even MORE overhead.

The other approach would be to make the tables read-only. If a customer wants to change their address, you don't edit their existing address, you insert a new record.

So if my address is AddressId 12 when I first order on your site in Jamnuary, then I move on July 4, I get a new AddressId tied to my account. (Say AddressId 123123 because your site is very successful and has attracted a ton of customers.)

Orders I palced before July 4 would have AddressId 12 associated with them, and orders placed on or after July 4 have AddressId 123123.

Repeat that pattern with every table that needs to retain historical data.

I do have a third approach, but searching it is difficult. I use this in one app only, and it actually works out pretty well in this single instance, which had some pretty specific business needs for reconstructing the data exactly as it was at a specific point in time. I wouldn't use it unless I had similar business needs.

At a specific status, serialize the data into an Xml document, or some other document you can use to reconstruct the data. This allows you to save the data as it was at the time it was serialized, retaining original table structure and relaitons.

yes....what @David Stratton said...removing my more colloquial answer of the same idea. — GDP, Jun 11 '12 at 20:02
@Greg P - I was going to vote you up. Your answer is more concise, and still relevant. — David, Jun 11 '12 at 20:03
Added it back for posterity...don't like to be repetitious when there's a more concise answer, thanks though. — GDP, Jun 11 '12 at 20:07

score 1 · Answer 4 · answered Jun 11 '12 at 20:01

You've opened up an eternal debate between the purist and practical approach.

From a normalization standpoint of your database, you "should" keep all the relevant data. In other words, say a product name changes, save the date of the change so that you could go back in time and rebuild your invoice with that product name, and all other data as it existed that day.

A "de"normalized approach is to view that invoice as a "moment in time", recording in the relevant tables data as it actually was that day. This approach lets you pull up that invoice without any dependancies at all, but you could never recreate that invoice from scratch.

score 0 · Answer 5 · answered Jun 11 '12 at 20:50

When you have time-sensitive data, you use things like the product and Customer tables as lookup tables and store the information directly in your Orders/orderdetails tables.

So the order table might contain the customer name and address, the details woudl contain all relevant information about the produtct including especially price(you never want to rely on the product table for price information beyond the intial lookup at teh time of the order).

This is NOT denormalizing, the data changes over time but you need the historical value, so you must store it at the time the record is created or you will lose data intergrity. You don't want your financial reports to suddenly indicate you sold 30% more last year because you have price updates. That's not what you sold.

"This is NOT denormalizing..." That's right. In relational systems, duplicate data means "the same values with the same meaning". Here, the values might be the same, but the meaning is different. (Current price, for example, versus price at the time of the order.) — Mike Sherrill 'Cat Recall', Jun 11 '12 at 22:41

Database Historization

5 Answers5

Linked