How to implement an append-only versioned model in SQLAlchemy

Question

I would like to re-implement some of my existing SQLAlchemy models in an append-only datastore; append-only meaning that object are only updated with INSERT statements, not using UPDATE or DELETE statements.

The UPDATE and DELETE statements would be replaced with another INSERT that increments the version. There would be an is_deleted flag and instead of DELETE, a new version with is_deleted=True would be created:

id  | version | is_deleted | name      | description ...
---- --------- ------------ ----------- ---------------
  1 |       1 |          F | Fo        | Text text text.
  1 |       2 |          F | Foo       | Text text text.
  2 |       1 |          F | Bar       | null 
  1 |       3 |          T | Foo       | Text text text.

Additionally,

All SELECT statements will need to be rewritten to only the maximum version number for each id, as described in this question: PostgreSQL - fetch the row which has the Max value for a column
All (unique) indexes need to be rewritten to be unique by the "id" primary key, as each id may be present more than once.

I know how to solve most of these issues, but I am struggling with the event hooks in SQLAlchemy that would handle certain things that need to be done on update & delete.

The SQLAlchemy documentation already has some basic examples for versioning. The versioned rows example comes close to what I want, but they do not handle (1) deletion and (2) foreign key relationships.

(1) Deletion. I know there is a session.deleted field, and I would iterate over it in a similar way to how session.dirty is iterated over in the versioned_rows.py example—but how would I unflag the item from the to-be-deleted list & create a new item?

(2) The above-mentioned example only deals with a parent-child relationship, and the way it does (expiring the relationship) seems to require custom code for each model. (2.1) Is there a way to make this more flexible? (2.2) is it possible to configure SQLAlchemy's relationship() to return the object with max(version) for a given foreign key?

The term you are looking for is "soft-delete". I don't really know SQLAlchemy, but maybe that'll help you. Personally I'd probably do this with triggers on the database side. — Craig Ringer, Oct 02 '14 at 15:39
Thanks — I found this on soft delete, which helps with the first part of my question http://stackoverflow.com/questions/23198801/sqlalchemy-using-aliased-in-query-with-custom-primaryjoin-relationship Using database triggers is not an option for me because my model changes frequently and triggers are not handled well by migration tools like alembic. — lyschoening, Oct 03 '14 at 09:34
Curious question, I'd also love to see some examples of handling this. Regarding linking to the max(version) the better way is probably to not have a version number against the Head version so you can join directly to the latest record (version==None). Does the data need to be in the same table? Just wondering if the versioning objects system (which creates a second table, blah_history) might simplify things http://docs.sqlalchemy.org/en/latest/orm/examples.html#versioning-objects . It also handles deletes in that the whole version history is available in the secondary table. — Aidan Kane, Oct 12 '14 at 02:05
@AidanKane `Null` as head is not an option because that would necessitate updating old records; the penalty on the `max` query is very small anyway. But I am open to better suggestions. Multiple tables would be fine with me as long as the result is still easy for people to interpret. The history table is something I might consider as an alternative to the kind of solution I am asking about here if no answer is posted. (I don't like that the history table does not enforce foreign keys though). — lyschoening, Oct 13 '14 at 08:50
Ok, so even though you just have to update the previous record with the version_id you can't - so I'll imagine we're writing to a tape log or something. Depends on the access as to the penalty if you're using max(), as I'm sure you know. For a single record it's ok but any sort of bulk selecting becomes a pain because you need to use group by to get efficient max values for a set of records. I'm not sure of an easy way of doing that with relationships (whereas `null` is simple). — Aidan Kane, Oct 13 '14 at 10:09
Could you have another table for tracking the head? I guess that's sort of the history/version model, where one table is the active state and the other is just historical state. — Aidan Kane, Oct 13 '14 at 10:10
I guess the history table could enforce foreign keys but what happens if some related historical data changes later? Either you need to also snapshot it at the time or accept that your historical information might not be accurate. Eg would a delete of a related record cascade into your history table? — Aidan Kane, Oct 13 '14 at 10:17
I just found this http://sqlalchemy-continuum.readthedocs.org/ which I've looked at briefly before (but never used). It's a much more complex versioning extension for sqlalchemy. Maybe it doesn't help at all but might be worth having a dig through the code for inspiration. - It handles versioning relationships http://sqlalchemy-continuum.readthedocs.org/en/latest/version_objects.html#version-relationships — Aidan Kane, Oct 13 '14 at 10:23
@AidanKane the foreign keys would point to other table that are either also append-only or have appropriate cascade rules that they cannot be deleted. As I said before, tracking the head is not a big issue with this design as the index is not very costly and this can also be cached with little effort. There could still be a tracking table e.g. to separate the version columns from the data, but I am looking for something that needs no UPDATE or DELETE. — lyschoening, Oct 13 '14 at 10:25
Take a look at continuum then - it keeps copies of the related tables in a similar fashion. — Aidan Kane, Oct 13 '14 at 10:38
This sounds a lot like what I've tried to do here: http://stackoverflow.com/questions/16087996/sqlalchemy-audit-logging-how-to-handle-deletes I gave up on getting things to automatically handle relationships, though, and just handle that all manually for the project I'm working on. — Tim Tisdall, Nov 17 '14 at 16:16

score 2 · Answer 1 · answered Nov 24 '14 at 18:21

One helpful thing that would be ORM tool agnostic might be "instead of" triggers. For instance, You could catch a before update event, and open a increment a version number with the newly updated data.

For postgresql they are detailed here.

Of course, you would have to have model changes (on PK's, etc.).

Also, it would be worth studying the performance impacts, as you would likely have to have a recursive query in order to fetch the "latest version" (through either a view layer, or in sql alchemy where clauses/etc.)

score 0 · Answer 2 · answered Apr 19 '15 at 08:23

As crazy at it may sound to you, it might in fact be better to use a different kind of database. Are you aware of Datomic?. One of the fundamental differences between a traditional RDBMS and this type of system is that there is no update-in-place, which is how an RDBMS updates files on disk. Instead, everything is versioned, and you can go back through all previous versions of the database for every change to every single resource. Additionally, you can easily see the state of the whole database at a particular moment in time, by just passing the time of interest as a parameter. There are a lot of other interesting advantages, and I highly recommend taking a look at some of Rich Hickey's talks on it, for example this one. It's definitely a fundamentally different approach to what you're currently attempting, but one has to consider whether that endeavour is going to be dogged by fighting against the tools at every step of the way, using them in a fashion that they really weren't designed (RDBMS, ORM, migration manager, ...). Instead, you could push that complexity down a layer and let a different kind of DB handle it for you.

How to implement an append-only versioned model in SQLAlchemy

2 Answers2