22

I'm looking into implementing object-versioning with the added twist of needing to have both live and draft objects, and could use the insights from someone experience in this, as I'm beginning to wonder if it's even possible without potentially horrific hacks.

I'll break it down to posts with tags for the sake of the example, but my use-case is a bit more general (involving slowly changing dimensions - http://en.wikipedia.org/wiki/Slowly_changing_dimension).

Suppose you've a posts table, a tags table, and a post2tag table:

posts (
 id
)

tags (
 id
)

post2tag (
 post_id fkey posts(id),
 tag_id fkey tags(id)
)

I'm in need of a couple of things:

  1. Being able to show exactly how a post looked like at an arbitrary datetime, including for deleted rows.
  2. Keep track of who is editing what, for a complete audit trail.
  3. Needs a set of materialized views ("live" tables) for the sake of keeping referential integrity (i.e. logging should be transparent to the developers).
  4. Needs to be appropriately fast for live and the latest draft rows.
  5. Being able to have a draft post coexist with a live post.

I've been investigating various options. So far, the best I've come up with (without points #4/#5) looks a bit like the SCD type6-hybrid setup, but instead of having a current boolean there's a materialized view for the current row. For all intents and purposes, it looks like this:

posts (
 id pkey,
 public,
 created_at,
 updated_at,
 updated_by
)

post_revs (
 id,
 rev pkey,
 public,
 created_at,
 created_by,
 deleted_at
)

tags (
 id pkey,
 public,
 created_at,
 updated_at,
 updated_by
)


tag_revs (
 id,
 public,
 rev pkey,
 created_at,
 created_by,
 deleted_at
)

post2tag (
 post_id fkey posts(id),
 tag_id fkey tags(id),
 public,
 created_at,
 updated_at,
 updated_by
)

post2tag_revs (
 post_id,
 tag_id,
 post_rev fkey post_revs(rev), -- the rev when the relation started
 tag_rev fkey tag_revs(rev), -- the rev when the relation started
 public,
 created_at,
 created_by,
 deleted_at,
 pkey (post_rev, tag_rev)
)

I'm using pg_temporal to maintain indexes on period(created_at, deleted_at). And I keep the various tables in sync using triggers. Yada yada yada... I created the triggers that allow to cancel an edit to posts/tags in such a way that the draft gets stored into the revs without being published. It works great.

Except when I need to worry about draft-row related relations on post2tag. In that case, all hell breaks loose, and this hints to me that I've some kind of design problem in there. But I'm running out of ideas...

I've considered introducing data duplication (i.e. n post2tag rows introduced for each draft revision). This kind of works, but tends to be a lot slower than I'd like it to be.

I've considered introducing drafts tables for the "last draft", but this quickly tends to become very very ugly.

I've considered all sorts of flags...

So question: is there a generally accepted means of managing live vs non-live rows in a row-version controlled environment? And if not, what have you tried and been reasonably successful with?

jangorecki
  • 16,384
  • 4
  • 79
  • 160
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
  • 1
    How do I add a bounty to this? – Denis de Bernardy Jun 11 '11 at 20:21
  • you can't add bounty to any question. Particularly questions must be at least 2 days old to be eligible for a bounty. Find all the limitations here: http://stackoverflow.com/faq. – Alex Aza Jun 11 '11 at 20:24
  • For reference: [cross-posted here](http://dba.stackexchange.com/questions/3274/temporal-database-design-with-a-twist-live-vs-draft-rows) – Denis de Bernardy Jun 11 '11 at 20:48
  • How different is this from what one could expect to be handled with a source code control library? – dkretz Jun 12 '11 at 19:14
  • @le dormer: it's not much different. In a SC environment, you have a multiple branches and the various files within each one are tightly coupled. My hope is to that there is a means to avoid exponential duplication of information. The thing is, in an SC environment you only need to deal with a single "row" (the whole source), which can have multiple versions. In a DB you can have many... – Denis de Bernardy Jun 12 '11 at 23:10

5 Answers5

11

Anchor modeling is a nice way to implement a temporal dB -- see the Wikipedia article too. Takes some time to get used to, but work nice. There is an online modeling tool and if you load the supplied XML file [File -> Load Model from Local File] you should see something like this -- also use [Layout --> Togle Names].

enter image description here

The [Generate --> SQL Code] will produce DDL for tables, views and point-in-time functions. The code is quite long, so I am not posting it here. Check the code out -- you may need to modify it for your DB.

Here is the file to load into modeling tool.

<schema>
<knot mnemonic="EXP" descriptor="Expired" identity="smallint" dataRange="char(1)">
<identity generator="true"/>
<layout x="713.96" y="511.22" fixed="true"/>
</knot>
<anchor mnemonic="US" descriptor="User" identity="int">
<identity generator="true"/>
<attribute mnemonic="USN" descriptor="UserName" dataRange="varchar(32)">
<layout x="923.38" y="206.54" fixed="true"/>
</attribute>
<layout x="891.00" y="242.00" fixed="true"/>
</anchor>
<anchor mnemonic="PO" descriptor="Post" identity="int">
<identity generator="true"/>
<attribute mnemonic="TIT" descriptor="Title" dataRange="varchar(2)">
<layout x="828.00" y="562.00" fixed="true"/>
</attribute>
<layout x="855.00" y="471.00" fixed="true"/>
</anchor>
<anchor mnemonic="TG" descriptor="Tag" identity="int">
<identity generator="true"/>
<attribute mnemonic="TGT" descriptor="TagText" dataRange="varchar(32)">
<layout x="551.26" y="331.69" fixed="true"/>
</attribute>
<layout x="637.29" y="263.43" fixed="true"/>
</anchor>
<anchor mnemonic="BO" descriptor="Body" identity="int">
<identity generator="true"/>
<attribute mnemonic="BOT" descriptor="BodyText" dataRange="varchar(max)">
<layout x="1161.00" y="491.00" fixed="true"/>
</attribute>
<layout x="1052.00" y="465.00" fixed="true"/>
</anchor>
<tie timeRange="datetime">
<anchorRole role="IsTagged" type="PO" identifier="true"/>
<anchorRole role="IsAttached" type="TG" identifier="true"/>
<anchorRole role="BYAuthor" type="US" identifier="false"/>
<knotRole role="Until" type="EXP" identifier="false"/>
<layout x="722.00" y="397.00" fixed="true"/>
</tie>
<tie timeRange="datetime">
<anchorRole role="Contains" type="PO" identifier="true"/>
<anchorRole role="ContainedIn" type="BO" identifier="false"/>
<layout x="975.00" y="576.00" fixed="true"/>
</tie>
<tie>
<anchorRole role="CreatedBy" type="TG" identifier="true"/>
<anchorRole role="Author" type="US" identifier="false"/>
<layout x="755.10" y="195.17" fixed="true"/>
</tie>
<tie>
<anchorRole role="CreatedBy" type="PO" identifier="true"/>
<anchorRole role="Author" type="US" identifier="false"/>
<layout x="890.69" y="369.09" fixed="true"/>
</tie>
<tie>
<anchorRole role="ModifiedBy" type="BO" identifier="true"/>
<anchorRole role="Author" type="US" identifier="false"/>
<layout x="1061.81" y="322.34" fixed="true"/>
</tie>
</schema>
Damir Sudarevic
  • 21,891
  • 3
  • 47
  • 71
2

I've implemented a temporal database using SCD type 2 and PostgreSQL Rules and Triggers, and wrapped it in a self-contained package for ActiveRecord: http://github.com/ifad/chronomodel

The design is independent from the language / framework, though - you can create Rules and Triggers manually and the database will take care of the rest. Have a look at https://github.com/ifad/chronomodel/blob/master/README.sql.

Also efficient indexing and querying of temporal data using geometric operators is included as a bonus. :-)

vjt
  • 486
  • 4
  • 9
1

post2tag_revs has a problem in that it is trying to express 2 fundamentally different concepts.

A tag applied to a draft post revision only ever applies to that one revision, unless the revision is ever published.

Once a tag is published (i.e. associated with a published post revision), it applies to every future revision of the post until it is revoked.

And associating with a published revision, or unasociating, isn't necessarily simultaneous with a revision being published, unless you artificially enforce this by cloning a revision just so you can associate tag additions or removals...

I'd change the model by making post2tag_revs.post_rev only relevant for draft tags. Once the revision is published (and the tag is live), I'd use a time stamp column to mark the beginning and end of the published validity. You may or may not want a new post2tag_revs entry to represent this change.

As you point out, this makes this relationship bi-temporal. You might improve performance in the "normal" case by adding a boolean to post2tag to indicate that the tag is currently associated with the post.

Bill Michell
  • 8,240
  • 3
  • 28
  • 33
  • Thanks for risking an answer... But it's not irrelevant, actually. Post gets drafted and a tag is tripped, another is added, and a third is renamed (arguably a new tag), but the post is kept as a draft while the live one lives on -- these changes should not appear on the live row. It's a *tough* question. :-) – Denis de Bernardy Jun 11 '11 at 21:03
  • Manual (as in set by the post's author) – Denis de Bernardy Jun 11 '11 at 21:18
  • 2
    Right. So then draft tags are fundamentally different to published tags. Draft tags only ever apply to the draft revision of the post, unless the post has been published. Once published, the tags apply to every future revision of the post, at least until they are then removed. Either the data model or the business objects need to understand this fundamental difference. – Bill Michell Jun 11 '11 at 21:21
  • Yep, that pretty much sums it up. :-( – Denis de Bernardy Jun 11 '11 at 21:23
  • Btw, "Not possible" is perfectly acceptable an answer as I stand. I've been stuck on this for weeks, in spite of extensive research on the topic... Hence the open question ("what have you tried and been reasonably successful with?"). :-| – Denis de Bernardy Jun 11 '11 at 21:34
  • As I noted, "I've considered introducing drafts tables for the "last draft", but this quickly tends to become very very ugly." -- but I see why you thought of it; it's possibly the least ugly I've though of so far. :-| – Denis de Bernardy Jun 11 '11 at 21:46
  • Point is that a tag is associated with a post for a span of time that lasts from the instant a draft revision is published, or the instant the tag is attached to a live revision. You need to be recording the time the tag association was published, not particularly the revision that was live at the time; that information can be inferred if necessary. However, a draft tag needs to know which draft post revision it is associated with; it may never be published, and this is the only way you know which tags to publish when a post revision goes live. – Bill Michell Jun 11 '11 at 21:50
  • "However, a draft tag needs to know which draft post revision it is associated with" -- yeah, that pretty much sums the whole issue. The drafts need to keep the associated data, without the latter being necessarily being published/live. And on top of it all, they may never get published at all. Hence my "running out of ideas" and probing for ideas. (Thanks for your time btw; I'll vote every relevant answer for the effort in the end, as it's an extremely tricky and thorny question.) – Denis de Bernardy Jun 11 '11 at 21:54
  • So add published_at to post2tag_revs (or possibly post2tag) and only set this value once the tag is linked to a published revision. If empty, the tag has never been published, and so should be ignored by anything that only cares about published data. When a post_rev is published, you should set this field on all the post2tag_rev entries that are associated with the post_rev. – Bill Michell Jun 11 '11 at 22:00
  • So, sort of like [bitemporal tables](http://en.wikipedia.org/wiki/Temporal_database), right? :-| Not saying it's wrong (it isn't), just hoping there's a better/more efficient way, because my own tests showed they weren't a great option when it came to performance. :-( (Voting you up, btw. Thanks so much for the input...) – Denis de Bernardy Jun 11 '11 at 22:04
  • Yup. Didn't know they were called that, but yes, that was what I meant, at least for this specific relationship. You don't seem to need it for any of the rest of the data model. Oh, and if you want efficiency, you can just add a boolean to post2tag to indicate the relation is currently valid. – Bill Michell Jun 11 '11 at 22:07
  • Hehe, I've researched it somewhat before asking. ;-) Which is also why I'm not accepting your answer immediately, even though I think it is valid. I'm hoping someone from postgres/mysql/dba.SE will show up with an astute answer... – Denis de Bernardy Jun 11 '11 at 22:11
  • (Re: your update) In other words, it's a bit like adding an is_live_draft flag to the post2tag_revs table. It's precisely what I'm hoping to avoid (albeit possibly wrongly), because it adds an extra query on the insert/update trigger (yes, revisions also need to be transaction-aware). – Denis de Bernardy Jun 11 '11 at 22:34
  • No. I'd decorate post2tag to show whether the tag was associated with a post revision that had ever been published, and the tag had not since been revoked. Yes, it does add to what needs to be done during a write, but I think it would be essential for read performance. – Bill Michell Jun 11 '11 at 22:43
  • Yeah, but that contradicts "3. Needs a set of materialized views ("live" tables) for the sake of keeping referential integrity (i.e. logging should be transparent to the developers)." -- if the row is not live, it shouldn't be in post2tag altogether. In fact, if a post2tag is deleted, post2tag_revs contains (and currently does contains) the created_at/deleted_at where it appeared and disappeared. – Denis de Bernardy Jun 11 '11 at 22:47
  • The best approach I've come up with atm is to duplicate all relations (and sub- and sub-sub- and ...) on new drafts, but it's a) hugely gory when it comes to new drafts and b) well, hugely gory. :-P – Denis de Bernardy Jun 11 '11 at 22:51
  • Ah. Yes, I get it. The example domain confused me a little. Here, draft posts are simply posts that haven't been published yet, and a developer would need to be aware of their existence. You're trying to hide them from the developer in your domain, right? – Bill Michell Jun 11 '11 at 22:59
  • Pretty much, yeah. In actuality, I currently also have a draft fkey (rev) key that points to the latest draft, but it seemed like an implementation detail that might push answers in a direction that might not necessarily be valid so I silenced it. It allows to grab (assuming the draft revs have null, or as I'm currently doing equal created_at/deleted_at cols) a draft post's relations based on draft rev ids, but it's hugely gory when the tags changes (as in the case). :-| Again, the question is by no means simple (and thanks again for the invaluable input; I'm literally scanning for ideas...). – Denis de Bernardy Jun 11 '11 at 23:06
1

I think I nailed it. Basically, you add a (unique) draft field to the relevant tables, and you work on the drafts as if they were a new post/tag/etc.:

posts (
 id pkey,
 public,
 created_at stamptz,
 updated_at stamptz,
 updated_by int,
 draft int fkey posts (id) unique
)

post_revs (
 id,
 public,
 created_at,
 created_by,
 deleted_at,
 pkey (id, created_at)
)

tags (
 id pkey,
 public,
 created_at,
 updated_at,
 updated_by,
 draft fkey tags (id) unique
)


tag_revs (
 id,
 public,
 created_at,
 created_by,
 deleted_at,
 pkey (id, created_at)
)

post2tag (
 post_id fkey posts(id),
 tag_id fkey tags(id),
 public,
 created_at,
 updated_at,
 updated_by,
 pkey (post_id, tag_id)
)

post2tag_revs (
 post_id,
 tag_id,
 public,
 created_at,
 created_by,
 deleted_at,
 pkey (post_id, tag_id, created_at)
)
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
0

Use only 3 tables: posts, tags and post2tag.

Add start_time and end_time columns to all tables. Add unique index for key, start_time and end_time. Add unique index for key where end_time is null. Add trigers.

For current:

SELECT ... WHERE end_time IS NULL

At time:

WHERE (SELECT CASE WHEN end_time IS NULL
THEN (start_time <= at_time)
ELSE (start_time <= at_time AND end_time > at_time)
END)

Search of current data is not slow because of functional index.

Edit:

CREATE UNIQUE INDEX ... ON post2tag (post_id, tag_id) WHERE end_time IS NULL;
CREATE UNIQUE INDEX ... ON post2tag (post_id, tag_id, start_time, end_time);

FOREIGN KEY (post_id, start_time, end_time) REFERENCES posts (post_id, start_time, end_time) ON DELETE CASCADE ON UPDATE CASCADE;
FOREIGN KEY (tag_id, start_time, end_time) REFERENCES tags (tag_id, start_time, end_time) ON DELETE CASCADE ON UPDATE CASCADE;
jordani
  • 436
  • 3
  • 3
  • In the current set-up, the posts, tags and post2tag tables are merely materialized views. In so far as I've tested, they're needed to enforce a unique index on id, so as to allow for foreign keys. Also, there is (courtesy of pg_temporal) a `period` type, which allows to rewrite the above as where `period(start, stop) @> :datetime`. And in so far as I've investigated it, using a single table doesn't solve the draft vs live row issue. My current hunch is that I'm going to need a draft_id field in the three live tables. – Denis de Bernardy Jun 12 '11 at 09:19
  • If id is serial you need no index. You need no period type too. And you need no "materialized views". Versioning of post2tag is required. – jordani Jun 12 '11 at 18:29
  • Err... Serial does not make a primary key. You need to actually declare it as such, to get the index. And period type is indispensable, to set up 1) an exclude constraint using gist and 2) an efficient index use on a correlated columns. Thanks for the suggestion, though. :) – Denis de Bernardy Jun 12 '11 at 23:00