11

I need to store a set of entities, of which there are several specialized versions. They have some common properties, but the specialized ones contain properties specific for that entity.

Solutions

The data store is a relational DBMS, and this is not for discussion :-) Specifically, it is the Microsoft SQL Server 2005.

I could easily create a table for the common properties and then a table for each of the specialized versions. However, it is likely that new entities will have to be added to the solution later and I don't want to maintain both an object model and a database schema.

Another idea is to create a table

reading(<common properties>, extended_properties)

and have the extended_propertiesfield be some kind of serialization of the extended properties. I was thinking either JSON or XML. I will most likely be using an ORM framework, but I haven't decided yet. Either way, the object representation of a specialized entity from the reading could expose a dictionary {extended_property_name, value} containing the parsed key/value pairs from the extended_properties field.

From this http://msdn.microsoft.com/en-us/library/ms345117(SQL.90).aspx I gather that XML fields, combined with schemas for these, give the notion of typed XML inside the DBMS. Also, queries involving the XML contents in the extended_propertiesfield can take these into account, too.

What I want

Feedback on my solution suggestions, primarily the one with the reading table and serialization of the extended properties.

Also, I realize this is one of the limitations of relational DBMS' compared to key/value based stores. However, there surely must be some modelling techniques to accommodate this.

Any feedback is greatly appreciated!

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
Anders
  • 856
  • 1
  • 7
  • 14

5 Answers5

18

Anders, do not give up any integrity or hardness, eg type safety.

(Response coming).

@Anders. No, not at all, subtyping is fine (the question is which form you use and what are the dis/advantages). Do not give up any strength or Integrity or type safety or checks or DRI. The form you choose will demand additional Checks and maybe a bit of code (depends on your platform).

This subject is coming up frequently, but the seeker always has a narrow perspective; I keep making the same statements (a subset) from an unchanging set. The idea is to evaluate all the options. So I am writing a doc. Unfortunately it is taking longer. Maybe 4 pages. Not ready to post. But the diagrams are finished, I think you are on the ball, and you can use it right away.

Warning: Experienced Project Construction Engineers Only
Road not suitable for caravans or readers with a high Eek factor

Link to ▶Four Alternative Data Models◀ in Document Under Construction. Apologies for the mess on the floor; I will clean up soon.

▶Link to IDEF1X Notation◀ for anyone who is unfamiliar with the Standard for modelling Relational databases.

  1. They are all Relational, with full integrity.

  2. The 6NF options. Relational today (SQL) does not provide support for 6NF; it does not disallow it, it just does not provide the 5NF➔6NF structures. Therefore you need to build a small catalogue, what some people call "metadata". Really, it is just an extension of the standard SQL catalogue (sys tables). The level of control required is modelled in each option.

  3. Essentially EAV done properly, with full control and integrity (type safety, Declarative Referential Integrity, etc) rather than the mess it usually is.

You may be interested in these related question/answers (in particular, look at the Data Models):

Multiple Fixed vs Abstract Flexible

Database Schema-Related Problem

"Simple" Database Design Problem

Response to Comments

... That way, we can easily grab "Comment" rows associated with a given specialized type instance. Is this the way to do that, or will I regret that decision later? Is there any other pattern we're missing?

Not sure what you mean. Comments, Notes, Addresses, end up being used (columns resident in) in many tables, so the correct method is to Normalise them; provide One Table for Comment; that is referenced from any table that requires it. Here is a generic Comment table. It is used in Product (the supertype) because you stated any Product. It can just as easily be used in some of the Product subtypes, and not others; in which case the FK will be in said Product Subtypes.

Your Data Model

What is the purpose of the ProductType table in your Product 5NF/subtype example? Does it contain a row corresponding to each specialized Product, e.g., ProductCPU? I assume it indicates which specialization the base product is.

(Small critical mistake in the diagram, corrected.)

Yes, exactly.

In Standard Relational terms (not the uncontrolled messes passing off as databases), the ProductType is the Discriminator; it identifies which of the Product Subtypes apply to this Product. Tells you which Product Subtype table you need to join with. The pair together make a logical Product. Do not forget to produce the Views, one for each ProductType.

  • (Do evaluate how ProductType changes, exactly what role it plays, for each of the four Data Models.)

  • "Generalisation-specialisation" is all mumbo jumbo, OO terminology; without crossing the line and learning what Relational has been capable of for 30 years. If you learn a little about Relational, you will have the full power; otherwise you are limited to the very limited OO approach to everything (Ambler and Fowler have a lot to answer for). Please read this post, from 11 Dec 10 onwards. Relational databases model Entities, not objects; not classes.

For example, when adding a new product you'll want to provide, say, a dropdown selection of which product types it is possible to add. Based on this selection, it can be deduced which tables to put the data in. Correct? I'm sorry for talking about application code, but I just need to put it into perspective

Yes. And what page (with fields) to provide next, for the user to enter data.

No problem talking about the app code that will use the Rdb, they go together like husband and wife (not husband and slave).

  • For your OO classes, map the Class tree to the Rdb, once you have finished modelling the Rdb, independent of any app that will use it. Not the other way around. And not dependent on one app.

  • Forget about "persisting", it has many problems (Lost Updates; damaged data integrity; problematic debugging; massive contention; etc). All updates to the Rdb should be in Transactions, with ACID compliance, available for 30 years, but Fowler and Ambler have not read about it yet. Usually that means one stored proc pre xact.

The discriminant is a FK to a Type-table as we established earlier. It denotes which spec. sub type the base type adheres to. But what does the discriminant table contain in detail?

Is that not clear from the data model ? ProducType CHAR(1) or (2). Name Char(30).

Could be a display-friendly text stating the type for UI-purposes,

Yes, among other things, such as the control, contraint, etc, elimination of ambiguity when coding or reporting.

but does it also contain the exact table name which contains the specialized type?

No. That would be a little too physical to be placed in data. Disallowed on principle.

But it is not necessary.

Say I'm interested in the Product with ID = 1. It has a discriminant indicating that it is a ProductCPU. How would you go about retrieving this ProductCPU from your app code?

That will be easy if you take the provided model, and implement it (all the tables) as classes, correctly, etc. The example you request will not use Views (which are for lists, and more generic use). The pseudo-code would be:

  • given the ProductId (Subtype unknown, therefore your should not be sitting a a Subtype-specific window), load the Product supertype only
  • based on the Discriminator Product.ProductType, set indicators, etc, and load the applicable subtype, one of ProductCPU; ProductMemory; ProductDisk; ProductTape; etc.

  • I have seen (and do not agree with) OO methods that load all subtypes for the given ProductId at once: one subtype is valid; and the rest are invalid. The code still has to constrain itself to the valid class for the Product based on Product.ProductType.

Alternately, eg. where the context is, the user is sitting in a Subtype-specific window, eg. ProductCPU, with that class set up, and requests ProductId xxx. Then use the ProductCPU View. If it returns zero rows, it does not exist.

  • There may be a ProductDisk xxx, but not a ProductCPU xxx. How you handle that, whether you indicate there is a Product`xxx but it isn't a CPU, or not, that depends on the app requirements.

For lists, where the app fills in a grid, without regard to the ProductId, use the views (one each) to load each grid. That SQL is based on the join, and does not need to refer to ProductType.

Community
  • 1
  • 1
PerformanceDBA
  • 32,198
  • 10
  • 64
  • 90
  • I'm reading your comments to the other replies now. So far so good :-) Am I completely wrong, going with the subtyping technique? As far as I'm concerned, it gives me the referential integrity and type safety I need. Comments are greatly appreciated. – Anders Dec 06 '10 at 06:59
  • @Anders. Answered in my post. – PerformanceDBA Dec 06 '10 at 15:28
  • I appreciate the time you've put into this. My apologies for not noticing this sooner, but I've been too busy with other things. I will take a look at this after lunch. – Anders Dec 20 '10 at 11:06
  • @Anders. Tweaked the drawings a little; updated the link. If you ask specific questions, I can take it from there. – PerformanceDBA Dec 21 '10 at 07:32
  • Thank you so much. I've run into another design issue. Until further analysis, we've settled with the subtyping, 5NF (full control). It suits the data we need to model. But now we need to associate a "Comment" with any of the specialized types. We'll probably make an FK from the Comment table which references a row in the base type table (Product in your example). That way, we can easily grab "Comment" rows associated with a given specialized type instance. Is this the way to do that, or will I regret that decision later? Is there any other pattern we're missing? – Anders Dec 21 '10 at 09:10
  • What is the purpose of the ProductType table in your Product 5NF/subtype example? Does it contain a row corresponding to each specialized Product, e.g., ProductCPU? I assume it indicates which specialization the base product is. For example, when adding a new product you'll want to provide, say, a dropdown selection of which product types it is possible to add. Based on this selection, it can be deduced which tables to put the data in. Correct? I'm sorry for talking about application code, but I just need to put it into perspective :-) – Anders Dec 21 '10 at 10:57
  • 1
    @Anders. My pleasure, and thanks. Answered in my post. Note the new (yours alone!) Data Model. – PerformanceDBA Dec 21 '10 at 11:40
  • Awesome :-) Back from the holidays, and I hope you had a good one. A minor thing, though.. The discriminant is a FK to a Type-table as we established earlier. It denotes which spec. type the base type adheres to. But what does the discriminant table contain in detail? Could be a display-friendly text stating the type for UI-purposes, but does it also contain the exact table name which contains the specialized type? Say I'm interested in the Product with ID = 1. It has a discriminant indicating that it is a ProductCPU. How would you go about retrieving this ProductCPU from your app code? – Anders Jan 04 '11 at 08:00
  • 1
    @Anders. My pleasure. Answered in my post. We don't know beforehand, what the ProductType for ProductId x is. – PerformanceDBA Jan 04 '11 at 19:20
1

I would go for the "create a table for the common properties and then a table for each of the specialized versions" method, personally.

Reason: you say that your implementation will be done in a RDBMS and this is non-negotiable. Fine. Dumping unstructured, blob-like stuff like a serialized hashtable in a DB field goes against the design philosophy of RDBMS though, so you will have a severe hit on efficiency unless you are ok with the idea of treating the *extended_properties* field as an opaque blob, just like a gif or another binary object.

In other words, forget querying (efficiently) for "all the objects having extended property COLOR=RED".

The problem you have (describing OO taxonomies in a RDBMS) is definitely not new. Have a look at this, for a in-depth description of the options.

p.marino
  • 6,244
  • 3
  • 25
  • 36
  • Good points. I failed to provide enough detail in my question regarding the RDBMS I'm using, which I've done now. I'll be using the MS SQL Server 2005, which has support for XML fields in tables. This means queries which involve my 'extended attributes' are able to take the contents of the XML tree into account as well. This supports indexing as well. – Anders Nov 29 '10 at 14:29
  • I don't have any direct experience with XML usage as part of a RDBMS, so personally I would be a bit wary (it also depends on how well this plays with 3rd party stuff like Crystal Reports etc, assuming this may be a concern). If this is something you know well (i.e. you can evaluate the drawbacks, if any exist) then go for it. – p.marino Nov 29 '10 at 15:01
  • Having discussed the more domain-specific aspects of the application with my colleagues, I think this is the way we will go. It is definately more important for us to preserve the referential integrity and other constraints at the cost of (very) little added maintenance. Thanks. – Anders Dec 01 '10 at 12:52
1

This is a classic example of the gen-spec design pattern. Gen-spec is covered in every tutorial on object modeling, because it is handled by inheritance. It's frequently skipped over in tutorials on relational data modeling. But it's well understood.

Do a web sreach on "generalization specialization relational modeling". You'll see several articles on how to set up a single table for the general class and a table for each specialized class. The articles will help you with foreign key design. In particular, the primary key of each specialized table does double duty. It's also a foreign key to the generalized table.

This won't look very familiar to you if you are used to object modeling. But you'll find it works well. And the solution offered by most of the articles is not dynamic, so you'll have to do some DDL every time a new specialized subclass is discovered.

Walter Mitty
  • 18,205
  • 2
  • 28
  • 58
  • Thank you for your thoughts. It was what @p.marino suggested too, and we're going with you guys' suggestion. – Anders Dec 01 '10 at 12:52
  • 1
    It isn't that "gen-spec" is skipped in Relational modelling, it is more accurate to say that we have Supertype-Subtpes, of two varieties. "gen-spec" is just the OO term. – PerformanceDBA Dec 25 '10 at 04:02
0

Here are five examples from the SO. Which one to use would depend on the actual problem you are solving and you preference.

In general I would advise against serializing data into DB fields.

If there are relatively few specialized versions of an entity you could simply use sub-typing like in these examples: example one, example two.

For large values of properties, or if the properties are to be defined dynamically (no schema changes), look into an implementation of the observation pattern like in these examples: example three, example four ; or into the so-called 6th normal form as described in: example five. Note that the "observation pattern" is a simplified version of the 6th NF.

Community
  • 1
  • 1
Damir Sudarevic
  • 21,891
  • 3
  • 47
  • 71
  • 1
    Thanks. Question. Since 6NF is the "Irreducible form", how does one "simplify" it ? I did look at your links; they are straight 5NF (only Trait is 6NF, and it looks incidental, not purposeful). – PerformanceDBA Dec 05 '10 at 14:13
  • Well, true -- "The Observation Pattern" is adopted from Martin Fowler's OO model (Analysis Patterns). The Measurement table plays both Numeric and Integer roles by adjusting the units -- for example numeric value of 1.52 meters maybe stored as integer 1520 with mm unit. Neat for ad-hock measurements -- sensors, automation etc. – Damir Sudarevic Dec 05 '10 at 15:16
  • @PerformanceDBA (... continued) -- So compared to your "Product Cluster" model, [Product Column] ~ [Property] ; [Column] ~ [Property Type] ; [Product Numeric]+[Product Integer] ~ [Measurement] ; [Product String] ~ [Trait] ; [Product] ~ [Subject]. There are no direct links from [Subject] to [Measurement] or[Trait]. So "simplified", but that may not be the right word to use here. – Damir Sudarevic Dec 05 '10 at 15:31
  • 1
    1) I am aware of all that. 2) A row is in 6NF if it consists of the PK plus one attribute only; it cannot be reduced (normalised) further. Two non-PK attributes fails 6NF. 3) Since you did not normalise to 6NF and then do something additional, "OO Oservation pattern is a Simplified form of 6NF" is completely false. 4) There is no comparison between the OO model and my model; the first models objects, I model data (Product is 5NF with the mandatory columns; the Optional columns are in 6NF tables). – PerformanceDBA Dec 06 '10 at 00:29
-1

This technique can have some issues with proper design and performance, but you seem to need a compromise with flexibility.

Instead of having to create specific tables or even a new field for each property you could have a table (And it will be quite large) for unique properties:

Unique_Property_ID
, FK_To_Some_Entity (Not sure what entity these link to: customers, bills, etc.) 
, Property_Type Not the data type but a link to a table describing this entity)
, Property_Value (Difficult to determine if all of your values will be of the same type: string, int, date, etc.)

Example: A used car dealership needing to track add ons for different makes and models (They never know what they're going to get). A few records may look like:

VehicleID   Property Type  Property Value
3           Sound System   Bose
3           Hybrid System  Battery
7           Starter Type   Hand Crank
7           Passenger Seat Rumble
9           Starter Type   Kick Start
9           Passenger Seat Side Car

Each value for each Set | property type would have its own record.

Another problem is when you do want each property to be represented as a column, you'll need to transpose this table.

JeffO
  • 7,957
  • 3
  • 44
  • 53
  • This solution seems to be a more relational idiomatic way of modelling what I want to achieve with my `extended_properties` field. Regarding the type safety, MS SQL Server 2005+ support casting at query-time. One could add a field `type` to the schema you suggest, and let the casting be done based on the value of this field. However, that is not what I consider 'type safe' or 'pretty' for that matter :-) – Anders Nov 29 '10 at 14:52
  • "Not sure what entity these link to: customers, bills, etc." is precisely why this approach is a bad idea. You are dropping referential integrity and making the whole design brittle. – p.marino Nov 29 '10 at 15:02
  • 1
    If it doesn't have RI, type safety, etc, it is not a database. "Transpose" ? Re-enter data ? Do you mean Pivot ? – PerformanceDBA Dec 05 '10 at 14:16
  • @PerformanceDBA - @p.marino - it's an EAV model and can be accomplished in a traditional relational database. http://en.wikipedia.org/wiki/Entity-attribute-value_model – JeffO Dec 06 '10 at 18:24
  • 2
    I am not interested in what some ever-changing, non-technical info of the web says, or what they call it this week; I am answering a Relational db question for Anders. If the structure you are proposing does not have DRI, type safety, etc, it is **not** a relational db. You an implement EAV **with** or **without** RI and type safety. Yours is without. – PerformanceDBA Dec 09 '10 at 21:58