Duplication of data in a database versus application design

Question

I have an application design question concerning handling data sets in certain situations.

Let's say I have an application where I use some entities. We have an Order, containing information about the client, deadline, etc. Then we have Service entity having one to many relation with an Order. Service contains it's name. Besides that, we have a Rule entity, that sets some rules concerning what to deduct from the material stock. It has one to many relation with Service entity.

Now, my question is: How to handle situation, when I create an Order, and I persist it to the database, with it's relations, but at the same time, I don't want the changes made to entities that happen to be in a relation with the generated order visible. I need to treat the Order and the data associated with it as some kind of a log, so that removing a service from the table, or changing a set of rules, is not changing already generated orders, services, and rules that were used during the process.

Normally, how I would handle that, would be duplicating Services and Rules, and inserting it into new table, so that data would be independent from the one that is used during Order generation. Order would simply point to the duplicated data, instead of the original one, which would fix my problem. But that's data duplication, and as I think, it's not the best way to do it.

So, if you understood my question, do you know any better idea for solving that kind of a problem? I'm sorry if what I wrote doesn't make any sense. Just tell me, and I'll try to express myself in a better way.

Sounds like your order wants a snapshot of the values of fields elsewhere at the point at which the order's placed. Your Services and Rules tables could have revision number columns so the orders reference the revision as well as the data, and when your UI thinks it's changing rules or services it's actually adding new records with higher revision numbers? — Craig Graham, Jan 27 '14 at 17:10
It's a very interesting way of handling the problem. The only downside of that would be the fact, that you'd have to clean up the database from entities that are not referenced anymore. — R3m4ke, Jan 27 '14 at 17:28
Or, if these changes are manual, you keep them around indefinitely and they become part of your configuration management (with additional columns to track who, when and why) — Craig Graham, Jan 27 '14 at 18:13
I think you have to copy the information. The price tag displayed in the store is dynamic but the price displayed on the invoice doesn't change. You don't need to copy everything, just what is needed. An other option would be to keep all historical changes and store a AsOf date "This order uses the information AsOf this date". — the_lotus, Jan 27 '14 at 18:41
Well, everything depends on how much flexibility I want. Thank you both for answers. You've made some really interesting suggestions. — R3m4ke, Jan 27 '14 at 20:39
You may be interested in: [Database optimization orders](http://stackoverflow.com/a/11930467/533120). — Branko Dimitrijevic, Jan 29 '14 at 16:08

score 0 · Answer 1 · answered Jan 28 '14 at 15:59

I've been looking into the same case resently, so I'd like to share some thoughts.

The idea is to treat each entity, that requires versioning, as an object and store in the database object's instances. Say, for service entity this could be presented like:

service table, that contains only service_id column, PrimaryKey;
service_state (or ..._instance) table, that contains:
- service_id, Foreign Key to the service.service_id;
- state_start_dt, a moment in time when this state becomes active, NOT NULL;
- state_end_dt, a moment in time when this state is obsoleted, NULLable;
- all the real attributes of the service;
- Primary Key is service_id + state_start_dt.
for sure, state_start_dt::state_end_dt ranges cannot overlap, should be constrained.

What's good in such approach?

You have a full history of state transitions of your essential objects;
You can query system as it was at the given point in time;
Delivery of new configuration can be done in advance by inserting an appropriate record(s) with desired state_start_dt stamps;
Change auditing is integrated into the design (well, a couple of extra columns are required for a comlpete tracing).

What's wrong?

There will be data duplication. To reduce it make sure to split up the instantiating relations. Like: do not create a single table for customer data, create a bunch of those for credentials, addresses, contacts, financial information, etc.
The real Primary Key is service.service_id, while information is kept in a subordinate table service_state. This can lead to situation, when your service exists, while somebody had (intentionally or by mistake) removed all service_state records.
It's difficult to decide at which point in time it is safe to remove state records into the offline archive, for as long as there are entities in the system that reference service, one should check their effective dates prior to removing any state records.
Due to #3, one cannot just delete records from the service_state. In fact, it is also wrong to rely on the state_end_dt column, for service may have been active for a while and then suppressed. And querying service during moment when it was active should indicate service as active. Therefore, status column is required.

I think, that keeping in mind this approach downsides, it is quite nice. Though I'd like to hear some comments from the Relational Model perspective — especially on the drawbacks of such design.

I wouldn't say it's over complicating the problem, but it's definitely complex solution, which requires some thinking before actual implementation. It would be useful if we'd need to use that story line of changes. Which here fits the case. Although what I often encounter is the fact, that aside from the data duplication, I have to add one additional field of information after the generation process of Order. Here, we could solve that with additional table, that has one to one relation with service_state. With regular data duplication, I just throw into the second table, with Order data. — R3m4ke, Jan 29 '14 at 10:44

score 0 · Answer 2 · answered Jan 28 '14 at 16:13

I would recommend just duplicating the data in separate snapshot table(s). You could certainly use versioning schemes on the main table(s), but I would question how much additional complexity results in the effort to reduce duplicate data. I find that extra complexity in the data model results in a system that is much harder to extend. I would consider duplicate data to be the lesser of 2 evils here.

Duplication of data in a database versus application design

2 Answers2

What's good in such approach?

What's wrong?