Adding columns versus adding rows - which offers better performance?

Question

Searched and searched. Not sure how to use Explain/Analyze to answer this, without constructing really large test tables and I don't have the means or time to pursue that. Certainly someone can confidently answer this likely simple question for me and save me hours of testing to find out.

I have a table which looks something like this:

id | destination_id | key | value | json_profile_data | deleted_bool | deleted_timestamp

The key and value were the original use of the table, but we recently began storing json arrays instead and now the key/value fields are unused. I want to add 3 new bits of data to this record id. My instinct is to make new columns in each row for the 3 new fields, but my associate wants to use the key/value cols to add the information using the same destination_id.

MY proposal means less rows in the table and looks like this:

id | destination_id | key | value | json_profile_data | claim_code | claim_date | claim_approved_bool | deleted_bool | deleted_timestamp

HIS solution is to add new rows, using the key/value cols to insert the three new bits of info with the same destination_id as their parent row on these new rows.

id | destination_id | null                | null  | json_profile_data | deleted_bool | deleted_timestamp
id | destination_id | claim_code          | value | null              | deleted_bool | deleted_timestamp
id | destination_id | claim_date          | value | null              | deleted_bool | deleted_timestamp
id | destination_id | claim_approved_bool | value | null              | deleted_bool | deleted_timestamp

His solution makes 4 rows per destination_id, mine makes 3 new columns on existing row for a given destination_id.

Which is more performant for selects against this table? Or does it matter? I hope I have written this in a way where its clear. Let me know if more elaboration is needed.

Your solution is more *normalized*, but the question of *performance* is so specific to your particular data and use case that I don't know if a confident answer can be given here. Sometimes the best way to get faster reads or something is to de-normalize data structures. — willoller, Aug 11 '14 at 20:44
this is the denormalized vs normalized debate...I'm sure you can find 20 years worth of reading on the topic using google. His method cannot be indexed...if you are searching using any of the three columns you have added, your method allows them to be indexed, his does not. On the other hand, if these three columns are very sparsely populated, his method may be better. It's a long debate ^^ — Twelfth, Aug 11 '14 at 20:49
As with most things database, the answer is _"it depends"_, in particular, on what resultset needs to be returned, and what predicates are specified and what indexes are available. With that said, adding columns to the table would likely give better performance than adding rows. A more important issue, I think, is the design of the insert/update/delete operations. The original table looks like an implementation of an EAV model. To stick with the EAV model, we'd add rows. If improved performance is the goal, we'd avoid EAV model entirely, and just store attributes as columns. — spencer7593, Aug 11 '14 at 20:50
Thanks for your inputs. Since you all commented I can't upvote or select any of these answers, but i learned what I need to know, which is that in our case, being able to index is the key to my decision and that "I win the office debate", more or less. ;) Thank you all! — ncgoldminer, Aug 11 '14 at 21:02
@ncgoldminer: I don't think "being able to index" is really the key argument in the debate; the crux of the issue is whether you want to run a query with predicates `AND claim_code='foo' AND claim_date>NOW() - INTERVAL 365 DAY`. That's simple with traditional relational model, and much more complicated with EAV. With EAV (adding rows) we would certainly need to have index(es) available to have any shot at reasonable performance. — spencer7593, Aug 11 '14 at 21:12
IMHO, the problem with EAV is not EAV *per se*, but that it allows you to run into 4NF / 5NF violations blindly. DBMSses tend to handle relations with 3 sets of candidate keys badly. "Solving" the problem by "denormalisation" will run into 1NF violations. — wildplasser, Aug 11 '14 at 21:51
Such a question *must* provide a proper table definition, showing data types (`\dt tbl`in psql). And *always* your version of Postgres, even if we might not depend on it here. — Erwin Brandstetter, Aug 11 '14 at 22:55
Why the [mysql] tag? The title addresses Postgres, and there is no reference to MySQL ... — Erwin Brandstetter, Aug 12 '14 at 00:02
My bad on the tag. My instinct was that this was a generic question. I realize now that it isn't. — ncgoldminer, Aug 12 '14 at 19:12

spencer7593 · Answer 1 · 2014-08-11T21:15:00.817

As with most things database, the answer is "it depends". In particular, is mostly depends on what resultset needs to be returned, what predicates are specified, the indexes are available, cardinality, etc.

With that said, in general, adding columns to the table would likely give better performance than adding rows.

A more important issue (I think) is the design of the insert/update/delete operations.

The original table looks like an implementation of an EAV (Entity Attribute Value) model; queries against EAV can get notoriously complicated when the results need to be "pivoted", and returned in a different format; or when we have predicates on multiple attributes.

To stick with the EAV model, we'd add rows to the table, and grind through the more complicated SQL that's required to work with that.

But if improved performance is the goal, we'd probably avoid EAV model entirely, and just store attributes as columns. That's the traditional relational database model: each row represents an "entity" (i.e. person, place, thing, concept or event that can be uniquely identified and we need to store information about), and each column represents an "attribute", a piece of information about the entity.

It's true this is a hybrid of EAV and non EAV. As noted in my other comments above, I think we need to go with adding the cols to allow for indexing, which will in turn maximize the efficiency of this table's intended usage and data content. Cheers! — ncgoldminer, Aug 11 '14 at 21:05

score 0 · Answer 2 · answered Aug 11 '14 at 20:47

0

As you said, you'd have to try this with realistic volumes of data to see it empirically, but there's no question that the 'added columns' approach will be more performant. The other method will require four joins, which will almost certainly slow things down.

answered Aug 11 '14 at 20:47

GreyBeardedGeek

29,460
2
47
67

doesn't require the 4 joins using postgres pivot syntax. The slower component comes in when he discovers he can't index any of the rows in the other method – Twelfth Aug 11 '14 at 20:50
By joins wouldn't you mean 4 where clause specifiers? Anyway, I think I got the answer in the main comment thread, which is that "my way" allows for indexing, which will benefit us in our situation with our data. Thanks! – ncgoldminer Aug 11 '14 at 21:04

score 0 · Answer 3 · edited May 23 '17 at 10:30

Your associate is suggesting EAV storage. Ample details in this related question on dba.SE:

Is there a name for this database structure?

The rest is for Postgres, only applicable to MySQL in parts.
You already have a json column, which is the obvious third solution to the problem, but neither of you seems to consider that? Maybe even just adding to the json column in place (That's not what I would normally do, though.) Actually, if you go that route consider the new jsonb in the upcoming Postgres 9.4.

However, as long as we we are talking about just those three columns (and not another new column every n weeks), your approach wins the performance bet in almost any possible aspect, by a long shot, too. Additional columns are much cheaper. Even if they are NULL most of the time, since NULL storage is very cheap:

Storage size is a major contributor to performance.

Either method can be indexed. For EAV storage you can employ partial indexes. To optimize this, one needs to know typical queries, access patterns, requirements and priorities. Your approach is typically simpler to manage.

The obvious aspects where your approach would loose:

If there is a variable (yet unknown) number of new columns you need to add on the fly. That's much simpler with the EAV approach.
If you have lots of updates to only (one of) the new columns. That's cheaper with small separate rows.

Recent related answer discussing many columns in a table, with a code for cross-tabulation, often needed for EAV storage:

SQL : Create a full record from 2 tables

`Storage size is a major contributor to performance.` an additional way to circumvent this is to store addtional culumns (or groups of them) in additional *tables* and left joining these to the base table. ("a poor man's column based DBMS") Tables are relatively cheap, and so are 1::0..1 joins, given correct PK/FK. (this kind of scheme can be automated by dynamic SQL) — wildplasser, Aug 11 '14 at 23:55
@erwin Brandstetter Thank you for the thoughtful and elaborate reply. We did discuss adding the cols to the json object, but wish to keep them separate because the json contains profile info that the user manages. The new columns are behind the scenes systemic controls/flag type info. We didn't want to mix user generated data with system generated data and ruled that approach out for that reason. we're your 2cd bullet. lots of json row updates, just one time inserts to the "flag" type rows, but lots of selects with them in where clauses. There are no more flag fields expected. Thanks a LOT! — ncgoldminer, Aug 12 '14 at 15:03
@wildplasser - joins may indeed be the way to go, but we've gone with the additional cols idea originally described in my first post for now. We'll see how it goes. I'm grateful to everyone who answered, but I don't have enough status here yet to upvote any/everyone. Please everyone just accept my gratitude for being awesome Internet hero's who help strangers wade through things. :) — ncgoldminer, Aug 12 '14 at 15:04

Adding columns versus adding rows - which offers better performance?

3 Answers3