Relational database schema design: Versioning properties on a large scale

Question

I am having some problems designing a schema for dynamic properties with versioning. Assume the following use case:

I have a table called Actor which contains the id and a name (to keep it simple). The upper bound for my case would be, that this table contains about 1 million entries.

Further every actor gets properties assigned to them. Because I don't know the properties at that time, I need a table to manage properties. I thought about a Property-table. The resulting n:m relation will be resolved with a table between Actor and Property containing their primary keys and the property value (type?).

At the moment this seems pretty easy to handle. With one million entries each having 10 properties, the ActorProperty table would have ten million nodes. I believe with btree index (log2(n)) this should be no problem.

Now comes the part I am struggling with. The properties should somehow be tracked. Over time these properties change, but the history should not get lost. Most likely it will be done with a timestamp. Note that multiple properties get updated at the same time. An example would be: I take a snapshot of all actors every day, and if something changes, I'll update all changed properties at the same time. This results in 365 timestamps per year.

If I would use another table to manage the versions (timestamps) and add another foreign key to the ActorProperty table I would get 365 * 10 million entries. This should be the maximum I would ever get. Most of the time the datasets will be significantly smaller.

My question now is more addressing the performance. I read the following answer about indexes: How does database indexing work. Is it not awfully slow to query a table with that amount of entries? An example query would be: First 100 actors with all their properties at a given timestamp id=x. Also I feel like the schema I'm thinking of is maybe not the best. Does anyone have any suggestions or ideas for a schema with more scalability?

By the way, I am currently also evaluating NoSql approaches, so I'd like to concentrate on relational approaches for the time being. My aim is to gather advantages and disadvantages of the different technologies and then have a theoretical schema or model for the described use case. And performance with an optimal model on a relational database is something I seemingly have trouble to evaluate or find.

Thanks!

score 1 · Accepted Answer · answered Jun 21 '12 at 12:57

The properties should somehow be tracked

Exactly how they should be tracked is the important thing here. Taking the simplest case, you woul want to query the state at any given time - so the solution would be to have multiple, time-dependant records in the decomposing table:

create table actor_property (
  actor_id INT NOT NULL,
  property_id INT NOT NULL,
  starttime DATE NOT NULL,
  endtime DATE NOT NULL DEFAULT 99991231
  PRIMARY KEY (actor_id, property_id, starttime, endtime) 
);

A consequence of this is that you need to handle the case when you attempt to link an actor to a property and the link already exists in the table (you can't update the table in a trigger but you can check for a conflict and force an exception). You can then query the state of the data at any time by.....

SELECT a.name, property.name
FROM actor a
INNER JOIN actor_property ap
   ON a.id=ap.actor_id
INNER JOIN property p
   ON p.property_id
WHERE $snapshot_date >= ap.starttime
AND $snapshot_date <= ap.endtime

Using a materialized view of the current records in actor_property in place in the above will be slightly faster - depending on how frequently the relationships change.

Is it not awfully slow to query a table with that amount of entries?

Not really, unless you need to analyse the entire dataset frequently, most operations only look at a small subset of the rows, and typically databases evolve areas of hot data - read caching is far more effective than mysql's query caching (which is very specific).

Thanks for mentioning the materialized view, I forgot about that. Would be a possibility, as the the data will not change frequently. I will test the query performance on a schema based on your answer. Do you maybe have some key elements to think of, which may have an impact on the query performance? I heard MyIsam is faster, because it does not have transactions or referential integrity. — Slomo, Jun 21 '12 at 13:26
The main difference between innodb and myisam is support for concurrency - which results in a big performance benefit for MyISAM if the concurrency of writes is very low. On an existing MySQL system if you the Table_locks_waited/Table_locks_immediate > about 0.2 then Innodb will be faster — symcbean, Jun 22 '12 at 08:16
Thank you! Very useful to know. Together with all the other answers I now have enough information. Unfortunately I can only tag one answer, even though every one of them helped me a lot... — Slomo, Jun 22 '12 at 08:37

score 1 · Answer 2 · answered Jun 21 '12 at 13:25

I've used a somewhat similar design in one of the applications.

First, I think that the set of properties will not be that big (in theory), so it's good to share it. For this purpose I would create a PROPERTY_TYPE table with unique ID and NAME columns. This way in the main PROPERTY table you'd have ACTOR_ID, PROPERTY_TYPE_ID and VALUE columns, which gives you 2 benefits:

size of the table is drastically reduced due to storing property name only once for all it's use cases;
performance of the queries will be significantly better.

Now to the property tracking. I like the approach when one tracks object's instances in time, each instance having it's start and end time. The currently active instance of the property can be found using now() BETWEEN start_dt AND coalesce(end_dt, now()), as open instance's end_dt is effectively NULL.

The schema will look like:

CREATE TABLE actor (
    actor_id   integer not null,
    actor_name varchar(100) not null,
    PRIMARY KEY (actor_id)
    );
CREATE TABLE property_type (
    property_type_id   integer not null,
    property_type_name varchar(100) not null,
    PRIMARY KEY (property_type_id),
    UNIQUE (property_type_name)
    );
CREATE TABLE actor_property (
    actor_id         integer not null,
    property_type_id integer not null,
    property_value   varchar(500) not null,
    start_dt         timestamp not null,
    end_dt           timestamp
    PRIMARY KEY (actor_id, property_type_id, start_dt)
    );

Notes on the implementation:

updating property is effectively an atomic close instance + create instance operation. Therefore it is good to either wrap this into the START TRANSACTION; ... COMMIT; block or (which I like better) create a function that will do the job;
using DB-side functions is a good style in any case;
primary keys on all the tables will have implicit indexes behind them which, in turn, will give you expected performance;
potential 365e6 rows in the actor_property table is not a big deal on a modern HW. Given your indexes are in place and are well balanced, you'll perform at most 30 disk page reads to query a single entry from this table in the worst case scenario.

Thank you very much! Especially your 4th note was the thing I needed to hear! — Slomo, Jun 21 '12 at 13:33

score 1 · Answer 3 · answered Jun 21 '12 at 13:53

Both @symcbean and @vyegorov are right in their approach - on modern hardware, the simple queries should be no problem with the volumes of data you're talking about.

However, the schema design (commonly known as "entity/attribute/value" or EAV) has some drawbacks when it comes to querying that you may want to consider.

Common relational statements can become very convoluted - and often slow. For instance, imagine a query to find the actors who have property "height" > 1.9, property "age" <= 25, property "agent" not like 'sleazeball', and where there is no current occurrence of property "hard to work with".

If the "property_value" colum is varchar, numerical comparisons tend to behave counterintuitively.

Searching for "in", "not in" etc. is awkward.

Interpreting "agent not like 'sleazeball' might mean two things - there IS a property called agent, and it's value is NOT sleazeball, or there isn't even a property called agent.

The reason I mention all these issues is to allow you to go a little further in your design - it's not enough to think about performance as a hypothetical thing, you need to think of realistic scenarios.

score 0 · Answer 4 · answered Jun 21 '12 at 17:06

Depending on your circumstances, you may find you get better performance if you split the problem down into "current properties" and "past properties". Various ORMs are taking this approach for their versionable behaviours, since it massively reduces the exponential cost of increased table sizes.

In your case, therefore, consider your Actor table being paired with:

ActorProperty (fk = actor_id)
ActorPropertyVersionable (fk = actor_id, version_num)

So, when writing a new property for an actor, you should first copy the existing value and insert that into the versionable table, and then add the new value into the current table. Wrap this in a transaction to keep it safe.

In general, property queries most often are interested in the current property value, and access to past values are required much less frequently (though you'll need to make a judgement about your own use case, of course). It does require two different queries (current value, past value) for each interrogation of your data, but the performance benefit may well be worth it.

Relational database schema design: Versioning properties on a large scale

4 Answers4