1

I've set up my database and application to soft delete rows. Every table has an is_active column where the values should be either TRUE or NULL. The problem I have right now is that my data is out of sync because unlike a DELETE statement, setting a value to NULL doesn't cascade to rows in separate tables for which the "deleted" row in another table is a foreign key.

I have already taken measures to correct the data by finding inactive rows from the source table and manually setting related rows in other tables to be inactive as well. I recognize that I could do this at the application level (I'm using Django/Python for this project), but I feel like this should be a database process. Is there a way to utilize something like PostgreSQL's ON UPDATE constraint so that when a row has is_active set to NULL, all rows in separate tables referencing the updated row as a foreign key automatically have is_active set to NULL as well?

Here's an example:

An assessment has many submissions. If the assessment is marked inactive, all submissions related to it should also be marked inactive.

ngoue
  • 1,045
  • 1
  • 12
  • 25

3 Answers3

2

To my mind, it doesn't make sense to use NULL to represent a Boolean value. The semantics of "is_active" suggest that the only sensible values are True and False. Also, NULL interferes with cascading updates.

So I'm not using NULL.

First, create the "parent" table with both a primary key and a unique constraint on the primary key and "is_active".

create table parent (
  p_id integer primary key,
  other_columns char(1) default 'x',
  is_active boolean not null default true,
  unique (p_id, is_deleted)
);

insert into parent (p_id) values
(1), (2), (3);

Create the child table with an "is_active" column. Declare a foreign key constraint referencing the columns in the parent table's unique constraint (last line in the CREATE TABLE statement above), and cascade updates.

create table child (
  p_id integer not null,
  is_active boolean not null default true,
  foreign key (p_id, is_active) references parent (p_id, is_active) 
    on update cascade,
  some_other_key_col char(1) not null default '!',
  primary key (p_id, some_other_key_col)
);

insert into child (p_id, some_other_key_col) values
(1, 'a'), (1, 'b'), (2, 'a'), (2, 'c'), (2, 'd'), (3, '!');

Now you can set the "parent" to false, and that will cascade to all referencing tables.

update parent 
set is_active = false 
where p_id = 1;

select *
from child
order by p_id;
p_id  is_active  some_other_key_col
--
1     f          a
1     f          b
2     t          a
2     t          c
2     t          d
3     t          !

Soft deletes are a lot simpler and have much better semantics if you implement them as valid-time state tables. FWIW, I think the terms soft delete, undelete, and undo are all misleading in this context, and I think you should avoid them.

PostgreSQL's range data types are particularly useful for this kind of work. I'm using date ranges, but timestamp ranges work the same way.

For this example, I'm treating only "parent" as a valid-time state table. That means that invalidating a particular row (soft deleting a particular row) also invalidates all the rows that reference it through foreign keys. It doesn't matter whether they reference it directly or indirectly.

I'm not implementing soft deletes on "child". I can do that, but I think that would make the essential technique unreasonably hard to understand.

create extension btree_gist; -- Necessary for the kind of exclusion
                             -- constraint below.

create table parent (
  p_id integer not null,
  other_columns char(1) not null default 'x',
  valid_from_to daterange not null,
  primary key (p_id, valid_from_to),
  -- No overlapping date ranges for a given value of p_id.
  exclude using gist (p_id with =, valid_from_to with &&)
);

create table child (
  p_id integer not null,
  valid_from_to daterange not null,
  foreign key (p_id, valid_from_to) references parent on update cascade,

  other_key_columns char(1) not null default 'x',
  primary key (p_id, valid_from_to, other_key_columns),

  other_columns char(1) not null default 'x'
);

Insert some sample data. In PostgreSQL, the daterange data type has a special value 'infinity'. In this context, it means that the row that has the value 1 for "parent"."p_id" is valid from '2015-01-01' until forever.

insert into parent values 
(1, 'x', daterange('2015-01-01', 'infinity'));

insert into child values
(1, daterange('2015-01-01', 'infinity'), 'a', 'x'),
(1, daterange('2015-01-01', 'infinity'), 'b', 'y');

This query will show you the joined rows.

select *
from parent p 
left join child c 
       on p.p_id = c.p_id 
      and p.valid_from_to = c.valid_from_to;

To invalidate a row, update the date range. This row (below) was valid from '2015-01-01' to '2015-01-31'. That is, it was soft deleted on 2015-01-31.

update parent
set valid_from_to = daterange('2015-01-01', '2015-01-31')
where p_id = 1 and valid_from_to = daterange('2015-01-01', 'infinity');

Insert a new valid row for p_id 1, and pick up the child rows that were invalidated on Jan 31.

insert into parent values (1, 'r', daterange(current_date, 'infinity'));

update child set valid_from_to = daterange(current_date, 'infinity')
where p_id = 1 and valid_from_to = daterange('2015-01-01', '2015-01-31');

Richard T Snodgrass's seminal book Developing Time-Oriented Database Applications in SQL is available free from his university web page.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • The reason we use `TRUE` and `NULL` instead of `FALSE` is because we have a unique constraint comprised of multiple values including `is_active`. It's not uncommon in our system to delete a record and later recreate another one that has the same unique values. This is fine until the second object gets marked as inactive. Once there are two objects that have the same unique values PostgreSQL complains – and rightfully so. By using `NULL` PostgreSQL doesn't recognize the two rows as conflicting because `NULL` doesn't evaluate as unique. – ngoue Apr 12 '15 at 05:07
  • FK on two columns is more reliable, for a very small price on storage. I would probably go this route. – Erwin Brandstetter Apr 12 '15 at 12:01
  • *"It's not uncommon in our system to delete a record and later recreate another one that has the same unique values."* How do you distinguish one tuple whose "unique" values are, for example {AK, Alaska} from 42 other tuples whose "unique" values are {AK, Alaska}? If you can't distinguish them, there's no point in allowing 43 of them. If you *can* distinguish them, then they're not "unique". Paste CREATE TABLE statements into your question, and add clarifying text, please. – Mike Sherrill 'Cat Recall' Apr 12 '15 at 12:04
  • The point of soft deleting records is for emergencies where we need to recover information that a user deleted. In our system we have courses and course members. It's possible for a course member to be removed from a course and then added again later. This "deletes" the first record and creates a subsequent new record with the same "unique" values. At any given time there should only be a single ***active*** unique object. Once it's deleted, we can have as many inactive objects that share the same unique values. By using `NULL`, deleted rows don't conflict with other deleted rows. – ngoue Apr 16 '15 at 17:20
  • @mcjoejoe0911: I know the point of soft deleting rows. I just think Boolean columns are a really bad choice for implementation. I added an implementation that I think is more robust, using range data types, exclusion constraints, and cascading updates. – Mike Sherrill 'Cat Recall' Apr 17 '15 at 03:53
  • @ErwinBrandstetter: I added an implementation of "soft deletes" as a valid-time state table, using PostgreSQL range types, exclusion constraints, and cascading updates. I think the semantics are much better than using Boolean columns. I'd appreciate your opinion. – Mike Sherrill 'Cat Recall' Apr 17 '15 at 12:20
  • In your design, `p_id` would not be unique. IMO it's better to have `p_id serial PRIMARY KEY` and `UNIQUE (valid_from_to, p_id)` with some redundancy. Makes queries simpler and faster. Details: http://stackoverflow.com/a/29688955/939860. You should probably mention that including an `integer` column in the exclusion constraint requires the additional module **btree_gist**. Details: http://stackoverflow.com/a/20908766/939860 and: http://stackoverflow.com/a/22111524/939860. Also relevant: stuff about **enforcing `[)` bounds** in timestamp ranges. – Erwin Brandstetter Apr 18 '15 at 22:55
  • And you wouldn't strictly need `infinity` in the time range: http://stackoverflow.com/a/27105923/939860 and: http://stackoverflow.com/a/15579208/939860. – Erwin Brandstetter Apr 18 '15 at 22:56
  • @ErwinBrandstetter: Thanks for your input. I don't think a unique "p_id" will work in the context of "soft deletes". An "undelete" has to give us a row with the "p_id" that was "deleted". – Mike Sherrill 'Cat Recall' Apr 18 '15 at 23:49
  • @MikeSherrill'CatRecall': Ah, I see: the "same" `p_id` at different points in time. I would then add another `serial` column (say: `parent_id`) as surrogate primary key, this is much easier to handle than `(p_id, valid_from)` for various purposes. `tsrange` has typically 25 bytes and comparatively slow for various purposes ... – Erwin Brandstetter Apr 19 '15 at 00:13
1

You can use a trigger:

CREATE OR REPLACE FUNCTION trg_upaft_upd_trip()
  RETURNS TRIGGER AS
$func$
BEGIN

UPDATE submission s
SET    is_active = NULL
WHERE  s.assessment_id = NEW.assessment_id
AND    NEW.is_active IS NULL;  -- recheck to be sure

RETURN NEW;                    -- call this BEFORE UPDATE

END
$func$  LANGUAGE plpgsql;

CREATE TRIGGER upaft_upd_trip
BEFORE UPDATE ON assessment
FOR EACH ROW
WHEN (OLD.is_active AND NEW.is_active IS NULL)
EXECUTE PROCEDURE trg_upaft_upd_trip();

Related:

Be aware that a trigger has more possible points of failure than a FK constraints with ON UPDATE CASCADE ON DELETE CASCADE.

@Mike added a solution with a multi-column FK constraint I would consider as alternative.

Related answer on dba.SE:

Related answer one week later:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Could you elaborate on what ways the trigger could fail? – ngoue Apr 12 '15 at 01:23
  • @mcjoejoe0911: Errors in the plpgsql code, errors in `CREATE TRIGGER`, other triggers interfering, you have to cover all relevant operations explicitly (`INSERT`, `UPDATE`, `DELETE`), special case: `TRUNCATE`. Unlike FK constraints , triggers are not enforced on already existing rows, which may be an issue for backup / restore ... – Erwin Brandstetter Apr 12 '15 at 11:23
  • I think this is a very helpful solution, but not the exact solution I'm looking for. – ngoue Apr 16 '15 at 17:23
  • 1
    @mcjoejoe0911: I agree with Mike's answer. If you can go that route, it's a good choice. I added a link to a related answer with more explanation and links. – Erwin Brandstetter Apr 16 '15 at 20:15
0

This is more a schematic problem than a procedural one.

You may have dodged creating a solid definition of "what constitutes a record". At the moment you have object A that may be referenced by object B, and when A is "deleted" (has its is_active column set to FALSE, or NULL, in your current case) B is not reflecting that. It sounds like this is a single table (you only mention rows, not separate classes or tables...) and you have a hierarchical model formed by self-reference. If that is the case you can think of the problem in a few ways:

Recursive lineage

In this model you have one table that contains all the data in one place, whether its a parent, a child, etc. and you check the table for recursive references to traverse the tree.

It is tricky to do this properly in an ORM that lacks explicit support for this without accidentally writing routines that either:

  • iteratively pound the crap out of your DB by making at least one query per node or
  • pulling the entire table at once and traversing it in application code

It is, however, straightforward to do this in Postgres and let Django access it via a model over an unmanaged view on the lineage query you build. (I wrote a little about this once.) Under this model your query will descend the tree until it hits the first row of the current branch that is marked as not active and stop, thus effectively truncating all the rows below associated with that one (no need for propagating the is_active column!).

If this were, say, a blog entry + comments within the same structure (a fairly common CMS schema) then any row that is its own parent is a primary entity and anything that has a parent that is not itself is a comment. To remove a whole blog post + its children you mark just the blog post's row as inactive; to remove a thread within the comments mark as inactive the comment that begins that thread.

For a blog + comments type feature this is usually the most straightforward way to do things -- though most CMS systems get it wrong (but usually only in ways that matter if you start doing serious data stuff later, if you're just setting up some place for people to argue on the internet then Worse is Better).

Recursive lineage + External "record" definition

In this model you have your tree of nodes and your primary entities separated. The primary entities are marked as being active or not, and that attribute is common to all the elements that are related to it within the context of that primary entity (they exist and have a meaning independent of it). This means two tables, one for primary entities, and one for your tree of nodes.

Use this when you have something more interesting going on than simply threaded discussion. For example, a model of components where a tree of things may be aggregated separately into other larger things, and you need to have a way to mark those "other larger things" as active or not independently of the components themselves.

Further down the rabbit hole...

There are other takes on this idea, but they get increasingly non-trivial, which is probably not suitable. For example, consider a third basic take on this model where the hierarchy structure, the node bodies, and the primary entities are all separated into different tables. One node body might appear in multiple trees by reference, and multiple trees may be considered active or inactive in the context of a single primary entity, etc.

Consider heading this direction if your data is more complex. If you wind up really needing models this far decomposed ("normalized") then I would caution that any ORM is probably going to wind up being a lot more trouble than its worth -- you will start running headlong into the problem that ORMs are fundamentally leaky abstractions (1 object can never really equate to 1 table...).

zxq9
  • 13,020
  • 1
  • 43
  • 60
  • I didn't realize that I did not explicitly state that the data I'm trying to manipulate exists in separate tables. There are no recursive relationships currently in my database. I've edited my question to say the data exists in separate tables. – ngoue Apr 12 '15 at 01:23