1

TL;DR: Is this design correct and how should I query it?

Let's say we have history tables for city and address designed like this:

CREATE TABLE city_history (
  id         BIGINT UNSIGNED NOT NULL PRIMARY KEY,
  name       VARCHAR(128)    NOT NULL,
  history_at DATETIME        NOT NULL,
  obj_id     INT UNSIGNED    NOT NULL
);
CREATE TABLE address_history (
  id           BIGINT UNSIGNED NOT NULL PRIMARY KEY,
  city_id      INT             NULL,
  building_no  VARCHAR(10)     NULL,
  history_at   DATETIME        NOT NULL,
  obj_id       INT UNSIGNED    NOT NULL
);

Original tables are pretty much the same except for history_id and obj_id (city: id, name; address: id, city_id, building_no). There's also a foreign key relation between city and address (city_id).

History tables are populated on every change of the original entry (create, update, delete) with the exact state of the entry at given time.

obj_id holds id of original object - no foreign key, because original entry can be deleted and history entries can't. history_at is the time of creation of history entry.

History entries are created for every table independently - change in city name creates city_history entry but does not create address_history entry.

  1. So to see what was the state of the whole address with city (e.g. on printed documents) at any T1 point in time, we take from both history tables most recent entries for given obj_id created before T1, right? With this design in theory we should be able to see the state of signle address with city at any given point of time. Could anyone help me create such a query for given address id and time? Please note that there could be multiple records with the same exact timestamp.

  2. There is also a need to create a report for showing every change of state of given address in given time period with entries like "city_name, building_no, changed_at". Is it something that can be created with SQL query? Performance doesn't matter here so much, such reports won't be generated so often.

  3. The above report will probably be needed in an interactive version where user can filter results e.g. by city name or building number. Is it still possible to do in SQL?

  4. In reality address table and address_history table have 4 more foreign keys that should be joined in report (street, zip code, etc.). Wouldn't the query be like ten pages long to provide all the needed functionality?

I've tried to build some queries, play with greatest-n-per-group, but I don't think I'm getting anywhere with this. Is this design really OK for my use cases (if so, can you please provide some queries for me to play with to get where I want?)? Or should I rethink the whole design?

Any help appreciated.

webfreak
  • 67
  • 1
  • 7
  • I find history tables function more easily with "valid/active" `from` and `to` datetimes. If your updates insert the current state into the history it is fairly easy, as the previous newest just needs set to `to = now()` and the new newest gets `to = null`. That way, queries for specific times just need `somedate >= from and (to IS NULL OR somedate <= to)`, rather than being flooded with tons of "last record with 'starting' before, or first record with 'until' after, somedate" logic. – Uueerdo Jan 30 '18 at 22:19
  • @Uueerdo Wow, thanks, this is great. This way for any specific time there's only one active history record in each history table, which means I can join city, street or whatever history entry without any weird subqueries. As far as I can tell right now it works fine for my second use case too. Haven't tried adding any filtering yet but it looks promising. `From` and `to` could be automatically updated via triggers too, awesome! – webfreak Jan 30 '18 at 23:05
  • Yes, using triggers to maintain the history is the way general pattern plays out. The non-history table usually ends actually be relegated to more of a "current/working configuration" table, since the "history" includes the even the current active version of the data. One thing to watch out for though is when restoring a backup either the triggers need temporarily dropped, or the history needs truncated after the "active" is restored and before the history is restored. – Uueerdo Jan 30 '18 at 23:25
  • This is a more detailed answer covering my comments https://stackoverflow.com/a/46857738/4104224 – Uueerdo Jan 30 '18 at 23:55
  • See https://dba.stackexchange.com/a/114738/75296 – Ulad Kasach Dec 19 '18 at 12:31

2 Answers2

3

(My answer copied from here, since that question never marked an answer as accepted.)

My normal "pattern" in (very)pseudo code:

  • Table A: a_id (PK), a_stuff
  • Table A_history: a_history_id (PK), a_id(FK referencing A.a_id), valid_from, valid_to, a_stuff

Triggers on A:

  • On insert: insert values into A_history with valid_from = now, and valid_to = null.
  • On update: set valid_to = now for last history record of a_id; and do the same insert from the "on insert" trigger with the updated values for the row.
  • On delete: set valid_to = now for last history record of a_id.

In this scenario, you'd query history with "x >= from and x < to" (not BETWEEN as the a previous record's "from" value should match the next's to "value").


Additionally, this pattern also makes "change log" reports easier.

  • Without a table dedicated to change logging, the relevant records can be found just by SELECT * FROM A_history WHERE valid_from BETWEEN [reporting interval] OR valid_to BETWEEN [reporting interval].
  • If there is a central change log table, the triggers can just be modified to include log entry inserts as well. (Unless log entries include "meta" data such as reason for change, who changed, etc... obviously).

Note: This pattern can be implemented without triggers. Using a stored procedure, or even just multiple queries in code, can actually negate the need for the non-history table.

  • The history table's "a_id" would need to be replaced with whatever uniquely identifies the record normally though; it could still be an id value, but these values would need synthesized when inserting, and known when updating/deleting.
  • Queries:
    • (if not new) UPDATE the most recent entry's valid_to.
    • (if not deleting) INSERT new entry
Uueerdo
  • 15,723
  • 1
  • 16
  • 21
  • I read "Triggers" - I downvoted... (A solid database setup does never ever ever ever ever ever ... EVER... require triggers - IMHO! ) – dognose Jan 31 '18 at 00:20
  • Yeah, that is usually the opinion of people who have had bad experiences with triggers being completely mismanaged and over used. But sure, go ahead and blacklist an entire feature of the sql standard for all possible use cases. – Uueerdo Jan 31 '18 at 00:24
  • That's an Option. My Opinion is: People with database-Knownledge don't require the presence of "Triggers" at all... – dognose Jan 31 '18 at 00:26
  • To let me "clearify" and not be the "bad guy": A database is for DATA. A filesystem is for files! - And Code? That's where the logic goes.... Triggers are a crippled way of "moving logic to the database"... :-) – dognose Jan 31 '18 at 00:29
  • I'll agree, triggers are not **required**, even for this answer. But the alternatives are: [1] stored procedure (better make sure everyone uses it, and is "logic in the database"), [2] transaction (better make sure everyone does it the same way, across all languages client applications might be developed in), [3] performed on server that standardizes the transaction (relatively easy in web development, which perhaps is where you are coming from, but completely overkill for applications that do not already have a server). – Uueerdo Jan 31 '18 at 00:36
  • Also, to be clear, triggers can and often are abused; I don't recommend them for many scenarios, but this is one of the few I do. – Uueerdo Jan 31 '18 at 00:40
  • I'll stay with what I've said: "Triggers" are just a workaroud for the inability of managing a database remotly (i.e. through queries) ... Databases are designed to perform "Data-Storage-Operations"... Fast (filtered) Data-Delivery, Fast Data-Storage... Everything on top of that is beyond the scope of databases... – dognose Jan 31 '18 at 01:03
  • I don't really like or hate triggers, sometimes they'reconvenient to use, but only if it is well documented. I don't plan to use triggers to insert history records - I need extra metadata available only in the application. But updating from/to timestamps in triggers is pretty straightforward. Triggers would exist on *history* tables, not main tables. About FK referencing A in A_history - what if A is removed? I'd like to avoid soft deletes, that's why no FKs in history table. – webfreak Jan 31 '18 at 08:52
  • @dognose I see your point, database is not the place for application logic. But in this situation updating from/to timestamps from triggers is for me the same as using auto_increment id column. I don't mean updating A_history in trigger on A, I mean updating A_history in trigger on A_history - otherwise I would need to add extra query on every insert in A_history and if anyone forgets that in just one place - the whole thing stops working. – webfreak Jan 31 '18 at 09:02
  • @webfreak the triggers would be on A not A_history; triggers cannot modify their own tables. Inserts into A would (via trigger) insert into A_history, updates on A would update the last A_history and insert a new one mirroring the current A record, deletes would just update the last A_history. – Uueerdo Jan 31 '18 at 17:27
  • @webfreak regarding soft deletes, this is meant as a general pattern. I've found very few places where you would have any deletes on data you actually care enough about to track the history of; in those scenarios it is usually enough to just delete the history for the master record along with it. In case where history usually matters, records tend to have some sort of "retired" flags set instead; so the last history record would reflect that. – Uueerdo Jan 31 '18 at 17:30
2

This is a very "traditional" Problem, when it comes down to versioning (or monitoring) of changes to a certain row.

There are various "solutions", each having its own drawback and advantage.

The following "statements" are a result of my expericence, they are neither perfect, nor do I claim they are the "only ones"!

1.) Creating a "history table": That's the worst Idea of all. You would always need to take into account which table you need to query, depending on DATA that should be queried. That's a "Chicken-Egg" Problem...

2.) Using ONE Table with ONE (increasing) "Revision" Number: That's a better approach, but it will get "hard" to query: Determining the "most recent row" per "id" is very costly no matter which aproach is used.

My personal expierence is, that following the pattern of a "double linked List" ist the best to solve this, when it comes down to Millions of records:

3.) Maintain two columns among every entity, let's say prev_version_id and next_version_id. prev_version_id points to NULL, if there is no previous version. next_version_id points to NULL if there is no later version.

This approach would require you to ALWAYS perform two actions upon an update:

  • Create the new row
  • Update the old rows reference (next_version_id) to the just insterted row.

However, when your database has grown to something like 100 Million Rows, you will be very happy that you have choosen this path:

  • Querying the "Oldest" Version is as simple as querying where ISNULL(prev_version_id) and entity_id = 5
  • Querying the "Latest" Version is as simple as querying where ISNULL(next_version_id) and entity_id = 5
  • Getting a full version history will just target the entity_id=5 of the data-table, sortable by either prev_version_id or next_version_id.

The very often neglected fact: The first two queries will also work to get a list of ALL first versions or of ALL recent versions of an entity - in about NO TIME! (Don't underestimate how "costly" it can be do determine the most recent version of an entity otherwise! Believe me, when "testing" everything seems equaly fine, but the real struggle starts when live-data with millions of records is used.)

cheers, dognose

dognose
  • 20,360
  • 9
  • 61
  • 107
  • I'd like to keep current state of data separately from history, perhaps in different DB, different machine etc. as it is used rarely in the application (most of the time we're just working on current data) and the idea of bloating my main tables with history data... well this would complicate things, unique constraints and ALL the queries in application code. The prev/next revision idea is probably the same as @Uueerdo idea with from/to timestamps, only I would need extra table to keep revision numbers, so I would know *when* data changed. – webfreak Jan 31 '18 at 08:42
  • Ad.1 Sorry, but I don't really get your point here. Do you mean deciding whether to query main table or history table? That would never be a problem for me. The app is working on main tables except for couple of places with "History" label ;). Deciding whether to fetch history data or main data is the same as deciding whether to fetch invoice data or profile data - history queries will be specifically designed for fetching history. – webfreak Jan 31 '18 at 09:14