0

In about every SQL-based database application I have worked on so far, sooner or later the following three-faceted requirement has popped up:

  • There is some entity, linked in a hierarchical fashion (i.e. the tuples form a tree structure).
  • Users must be able to define any number of custom attributes with values for the tuples, and these values are inherited/overridden towards the leaves of the tree structure. ("Dumb" attributes usually suffice. That is, no uniqueness constraints, no foreign keys, only one value per attribute, ...)
  • Users must be able to run arbitrary queries on this data (i.e. custom boolean expressions, based upon filters for the values of the user-defined attributes that are linked with AND/OR).

Storing the data, roughly matching the first two bullets above, is quite straightforward:

  • The hierarchy is built up by giving the respective table a parent column. This column will be null for root nodes, and a pointer to the ID of the parent node for all other nodes.
  • The user-defined attributes are stored according to the entity-attribute-value pattern.

While there are numerous resources that suggest to use a different approach especially in the latter point (e.g. answers here, here, or here), I have not usually been in a position to move away from a traditional static relational database schema. Hence, let's simply assume the above as a given. Also, hardly ever could I rely on the specifics of a particular DBMS; the more usual case was systems that were supposed to work with MS SQL Server, Oracle, and possibly others as backends without requiring two significantly different product versions.

Solving the third item, however, is always problematic (even without considering the hierarchical inheritance of attribute values). The number of joins depends on the different number of attributes considered in the boolean expression. Alternatively, the number of joins can somewhat be reduced by determining the maximum number of distinct attributes considered in any case of the custom boolean expression, which may save joins, but makes the resulting queries and the code used to generate them even less intelligible and maintainable. For instance,

a = 5 or (b = 8 and c = 9)

could do with 2 joins to the attribute-value table.

I have always been able to do this "somehow", but as this appears to be a fairly ubiquitous situation, I am looking for the "canonical" way to generate SQL queries in this situation. Is there a "standard pattern" to follow here?

Community
  • 1
  • 1
O. R. Mapper
  • 20,083
  • 9
  • 69
  • 114

1 Answers1

1

Careful not to fall prey to the inner platform effect. It is a complicated problem, and SQL itself is designed to handle the complexities. Generate DDL to add and remove columns as needed, and generate simple select statements for queries. Store each Tuple Type (distinct set of attributes) as a table.

With regards to inheritance, I recommend handling it in the application or DAL, and only storing the non-inherited values. On retrieval, read all parent rows to calculate the functional values. If you do need to access "functional" values from SQL, use an indexed view or triggers to maintain them separate from storage.

Hierarchies can be represented as you describe, but a simple "Parent" column can make it difficult to query beyond a single level. Look at hierarchyid on SQL Server or CONNECT BY on oracle.

Avoiding EAV stores allows you to:

  • Use indexes and statistics where needed
  • Keep efficient storage (ints stored as ints, money stored as money)
  • Keep understandable queries (SELECT * FROM vwProducts WHERE Color = 'RED' ORDER BY Price ASC)

If you want an EAV system because you have too many attributes (>1024 per type) or they are not somewhat statically defined (many changes per hour), I would avoid using a relational database in the first place. Use an EAV (NoSQL) database server instead.

tl;dr: If you have a schema, use DDL to tell the server about it. If you don't, use a more appropriate server.

Mitch
  • 21,223
  • 6
  • 63
  • 86
  • "Store each Tuple Type (distinct set of attributes) as a table." - I am not sure how to combine this suggestion and the restriction "I have not usually been in a position to move away from a traditional static relational database schema. Hence, let's simply assume the above as a given." stated in the question. The suggestions concerning the hierarchy sound helpful, though; thank you for these. – O. R. Mapper Mar 26 '16 at 19:28
  • @O.R.Mapper, the canonical approach is to use a database server as it is designed. Storing anything as EAV in a table based database server is unquestionably a hack. Allowing the user to add arbitrary attributes is the very definition of a dynamic schema. Why do you want a static schema? – Mitch Mar 26 '16 at 19:34
  • *I* do not want a static schema. It's the unfortunate reality of business settings that the people in the project who are in charge of the DB administration will tell me that I'm getting exactly the tables I ordered in advance, and they are backed up by the higher-ups who strictly reject the idea that any user running the application could possibly be allowed to have rights to alter anything about the schema. – O. R. Mapper Mar 26 '16 at 19:39
  • 1
    @O.R.Mapper, you can always have a dedicated schema for user defined queries. This is really a question of educating the client to allow the right tool for the job. What you are asking to do is not wise nor "canonical". What are their reasons for forcing a static schema? It is often helpful to show the DBA the queries and schemas that they will have to deal with in both scenarios, I've not met one that would prefer EAV. – Mitch Mar 26 '16 at 19:48
  • "What are their reasons for forcing a static schema?" - in a recent example ... the people who decide about the project (a) are absolute fans of SQL, (b) are happy to have just recovered from a failed experiment at using an object DB and consider "object DBs" a taboo topic since, (c) are convinced that `WHERE EXISTS (SELECT '0' FROM KeyValueTable x WHERE x.y = 42))` constructs are just as fast and avoid the joins, (d) are entirely opposed to granting any user schema editing rights even for single tables, and (e) are convinced dynamically adding columns is impractical due to frequent ... – O. R. Mapper Apr 04 '16 at 17:26
  • ... addition/removal of properties by users, anyway. Also, they (f) consider generating a materialized table prohibitively expensive in terms of space and - if done lazily/temporarily - also in time. In short, a fixed schema with a key-value table whose contents can be inherited along a hierarchy is what the powers that be have decided upon. These aren't the only reasons I have encountered, and rarely all of them, and I can also code away without any such obstacles in my own projects, of course - but having that amount of leeway is, sadly, not always a given. – O. R. Mapper Apr 04 '16 at 17:27