4

I have to choose the structure of a database that will store content types (eg. Blog articles, Pages, Documents, Invoices, Estimates, etc..) with dynamic fields: for example, the Estimate content type should have the fields title, date and total price.

However in the time those fields can be added ore removed, so after 1 year the Estimate contant type can have the notes field.

This is a common task provided by famous CMS (drupal for example), but I wonder what is the best approach to have best performance and flexibility: Drupal for example use to have one table with basic fields (e.g. title), and all the secondary fields are stored in sub-tables created on-the-fly and linked to the main one with foreign keys:

table node
| id | title         | ...
|  1 | First example |
table fields_node_total_price
| id | node_id | value  |
|  1 | 1       | 123.45 |
table fields_node_date
| id | node_id | value    |
|  1 | 1       | 12345677 |

etc..

My point of view is that this approach is very flexible but easly fall into performance issue: in order to get all fields for a document, you must join the tables many times, and the code itself have to iterate many times to build the query (but this shouldnt be a problem).

Btw multi-table is the most-used approach.. so must have many cons.

Im thinking in what kind of disvantages will using a single table have:

| id | title | total_price | date | ec...

I did some tests with 5 and 50 additional fields; the performance between the single table approach and the multi table approach are enourmous: single table is about 50x time faster.

Every time a field is added, a column is added to the table.. what kind of problems will this approach rise?

EDIT

Let me provide few details:

  1. The application is still in design phase, is a complete redesign of an old application where the fields numbers were static
  2. We did few tests simulating a object to store, both with single table approach and multi table approach (using 50 fields), results are:

Time in seconds:

Test                                                            1°          2°          3°          4°          5°          avg
1000 insert single_table                                        8,5687      8,6832      8,7143      8,7977      8,6906      8,69090137389466
1000 select single table LIKE '%key%' on char(250) field        1,5539      1,5540      1,5591      1,5602      1,5564      1,556705142
1000 select single table LIKE '%key%' on char(25) field         0,8848      0,8923      0,8894      0,8919      0,8888      0,889427996
1000 select single table id = $n                                0,2645      0,2620      0,2645      0,2632      0,2636      0,263564462
1000 select single table integer field < $j                     0,8627      0,8759      0,8673      0,8713      0,8767      0,870787334
1000 insert multi_table                                         446,3830    445,2843    440,8151    436,6051    446,0302    443,023531816
1000 select multi table LIKE '%key%' on char(250) field         1,7048      1,6822      1,6817      1,7041      1,6840      1,691367196
1000 select multi table LIKE '%key%' on char(25) field          0,9391      0,9365      0,9382      0,9431      0,9408      0,939536426
1000 select multi table id = $n                                 0,9336      0,9287      0,9349      0,9331      0,9428      0,93460784
1000 select multi table integer field < $j                      2,3366      2,3260      2,3134      2,3342      2,3228      2,326600456
Strae
  • 18,807
  • 29
  • 92
  • 131
  • "you must query the database many times" - um, no. You'd construct a single query that joins to each of the tables that you want to retrieve data from. "the code itself have to iterate many times" - again, no, unless you're doing something odd. – Damien_The_Unbeliever Jan 10 '12 at 13:11
  • @Damien_The_Unbeliever youre right; with `query the db many times` i mean even construct query with many joins, that ususally could result in performance slow – Strae Jan 10 '12 at 13:13
  • maybe this question was a better fit for http://dba.stackexchange.com/ ? – Stefano Jan 19 '12 at 08:58

4 Answers4

6

It may be worthwhile investigating what is possible with NoSQL databases. I haven't used them much myself, but given you say you need to "...store content types (eg. Blog articles, Pages, Documents, Invoices, Estimates, etc..) with dynamic fields" it seems as though it may be a reasonable approach.

From the Wikipedia article;

...These data stores may not require fixed table schemas, usually avoid join operations, and typically scale horizontally.

and

Often, NoSQL databases are categorized according to the way they store the data and it falls under categories such as Key-Value stores, BigTable Implementations, Document-Store databases and Graph Database.

I'm not saying it is the answer to all your problems, but I'd certainly say it's worth a look.

With regards to other approaches, I've used Entity-Attribute-Value (EAV) in the past, and while the performance probably lags behind having a fixed schema, I feel it is a compromise that had to be made to afford the flexibility in the schema.

My situation is likely to differ from yours, but I'll lay it out for you in case it is any help. We broke the table structure into something that was logical for our situation. There is a bit of a natural hierarchy in that there is a parent table which most of the other table relate to.

Even though we needed dynamic structure due to the variety of the data we are dealing with, there was also some fixed structure. Therefore, for each table requiring dynamic structure, we created a "main" table, and an "attribute" table.

An example of this (SQL Server specific) can be seen below;

CREATE TABLE [dbo].[ParentTbl](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [KnownCol1] [real] NOT NULL,
        -- Lots of other columns ommitted
    [KnownColn] [real] NULL
)        

CREATE TABLE [dbo].[MainTbl](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ParentId] [int] NOT NULL, -- FK to ParentTbl.Id
    [KnownCol1] [real] NOT NULL,
        -- Lots of other columns ommitted
    [KnownColn] [real] NULL
) 

CREATE TABLE [dbo].[MainTblAttr](
    [Id] [bigint] IDENTITY(1,1) NOT NULL, -- Note big int to cater for LOTS of records
    [MainId] [int] NOT NULL, --FK to MainTbl.Id
    [AttributeColumn] [nvarchar](255) NOT NULL,
    [AttributeValue] [nvarchar](max) NOT NULL
)

You can then perform a PIVOT query to help get your data out. Given you will have different attributes you need to determine which columns to include in the pivot. I found this example to be invaluable when I was developing my solution. However, there are loads of examples on SO. Just search for pivot dynamic columns.

In my instance, having a parent table is a big help in limiting the amount of data I need to trawl through as it limits the child records that I need to look at. This might not be so in your case, but hopefully this will give you some ideas.

Best of luck.

Mr Moose
  • 5,946
  • 7
  • 34
  • 69
  • 1
    I would stay away from NoSQL unless you KNOW you need them. Unless you are google, amazon, etc. I would think not. The big strength of NoSQL is that they can cope with far larger data volumes than traditional relational databases. This however comes at the expense of the convenience of use of a relational database: e.g. no descriptive query languages, fewer guarantees regarding data consistency, etc. - In short, they can be incredibly efficient if dealing with massive amounts of data but are awkward to use in comparison. To leverage the efficiency you need a decent sized cluster to run them. – ARF Jan 19 '12 at 12:18
3

There is no single "right" answer to this question. As you have already mentioned it comes down to a trade-off between flexibility and speed.

It depends on what the bottle-neck in your application is. Have you done some profiling on your application? Is database query time relevant with respect to typical end-user ping-times, transfer speed, etc? There is really no point worrying about optimizing performance until you are sure that you actually have a performance problem AND know where the bottle-neck is!

I like to use firebug on Firefox to time how long my page takes to display with the end-user and compare this to the result of a stopwatch timer started before the query and stopped after the query. For ease of use, I print it at the bottom of every page during profiling.

Have you considered views to offset the disadvantages of the multi-table approach?

Regarding the complicated query problem: with "virtual" views you can avoid having to use complicated joins in your every day queries. You put the joins in the view definition and on changing the dynamic fields you only have to adjust the view. (Note: with virtual views your "simple" query is dynamically rewritten using the joins in the view definition.)

Regarding the speed problem: you could use "materialized" view definitions with the multi-table approach to get single table performance. With materialized views the DBMS creates a physical table using the view definition by using the joins in the view definition. The result is that you are truly querying a "single table" - which however is kept automatically in synch with your multi-table definition. You get the best of both worlds at the expense of DB storage space.

Depending on your DBMS you can also update the views directly (rather than the multi-table). I believe this is the case with MySQL. With Postgres you need to use triggers to tell the system how to modify the underlying multi-table.

In Summary:

  1. Personally, if I wanted to create a system to last, I would go with the multi-table approach with virtualized views. I would then "materialize" only those views on which I feel that performance is lacking. This is more effort to get off the ground running at single-table speed but It will remain incredibly flexible.
  2. If I wanted something quick and dirty yet fast I would go with single-table. - But it may occasionally be a pain in the but to incorporate some of the changes. I do not see a problem arising from having a large number of columns. Any relational DBMS should be ok.
  3. If I wanted something quick and dirty but flexible, I would go with multi-table and not worry about defining views & triggers but only define some indexes to speed up the join operations.

Last point: You should really try to do as much data-processing as possible in the DBMS. (i.e. with the query) You have already realized that "the code itself have to iterate many times to build the query" it not true (see view, etc.). However, it suggests that you have a tendency to do too much data-processing in your application. SQL is incredibly expressive and your DB will most likely use far more efficient algorithms to evaluate your data-processing than anything you are likely to implement yourself. Note: a SQL query that looks incredibly complex may in fact run very fast!

Thus if you are doing anything other than just looping through your query result to display a web-page, you can probably still put more of the logic into your query.

ARF
  • 7,420
  • 8
  • 45
  • 72
  • 1
    `Have you done some profiling on your application?` mmh no, the whole application (db, server, code) is still in design phase. We did few tests simulating the single and multi table structures with fake data, and the results show that single-table is the 4997% faster than multi table in insert, 8% faster with `select ... like '%key%' on a big field, 5% faster with `like 'select ... %key%' on a small field, 254% faster with `select ... id = X' and 167% faster with `select ... integer_field < y'. – Strae Jan 19 '12 at 10:36
  • It may well be that the query times will be insignificant in the grand scheme of things: e.g. so what if your query takes only 0.01 ms instead of 1 ms to execute if the web page you are displaying to your user with the result takes the average PC 500 ms to render? Your time would have been better spent trying to clean up your html templates. - I would go with the implementation that is most convenient to use. If it turns out to really be too slow, it should not be too difficult to change the DB backbone structure later provided you use some abstraction layers in your coding (e.g. views). – ARF Jan 19 '12 at 12:11
  • 1+ for `Note: a SQL query that looks incredibly complex may in fact run very fast!` also, overall I liked you answer.. specially the `materialized` – Junaid Aug 06 '12 at 11:17
0

The first solution is a "value attribute" database : Entity Attribute Value Database vs. strict Relational Model Ecommerce

I'd go for the later solution : database are made to store data not structure ! We are having major problem because we have a entity value attribute database in witch we can insert any kind of data but it's impossible to query them or to target a specific data without magic strings.

Or you can do an other solution : store your additionnal field in a serialized version of your AdditionnalFields object.

Community
  • 1
  • 1
remi bourgarel
  • 9,231
  • 4
  • 40
  • 73
0

In large systems (50+ columns using replication with 5+ hosts) the load related to updating a single row in a table increases when adding additional columns (bc. the whole row has to be replicated). This effect can be reduced by splitting a big table into multiple parts. When using proper indexes this comes at nearly no cost for analytic workloads. Although it impairs the performance of inserts.

mschneider
  • 716
  • 4
  • 9