Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?
-
1If SO needed to join tables to show comment counts, it wouldn't exist. But worrying what will happen when your site is hit by millions of pageviews per day is, let's just say, premature -- so in your own projects, go with the `COUNT`. – Jon Mar 17 '11 at 21:01
-
5Don't prematurely optimize. Keep databases normalised until you need to denormalise them. – Quentin Mar 17 '11 at 21:02
-
1@Jon: interesting...could you elaborate or provide a link ? I am nuts outside the relational model, but always ready to learn... – iDevlop Mar 17 '11 at 21:03
-
@iDevlop: Submitted an answer, have a look. – Jon Mar 17 '11 at 21:15
5 Answers
Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.

- 232,168
- 48
- 399
- 521
-
1it's not denormalisation more an optimisation and requires a few triggers - hardly a headache ! – Jon Black Mar 17 '11 at 22:04
-
@JonBlack - Yes, it's an optimization (although, as I said in my answer, whether the "optimization" is worthwhile is not so clear). At the same time, it most definitely is a denormalization. Specifically, a `num_comments` column violates third normal form because it introduces a non-key dependency -- a value that is not dependent on the key but, in this case on values that most likely come from a completely different table. As for headaches, the issue is not just writing "a few triggers", but also having to maintain the triggers along with everything else as the db evolves. – Ted Hopp Oct 31 '17 at 01:09
The former is not normalized but will produce better performance (assuming many more reads than writes).
The latter is more normalized, but will require more resources and hence be less performant.
Which is better boils down to application requirements.

- 489,969
- 99
- 883
- 1,009
I would suggest counting comment records. Although the other method would be faster it lends to a cleaner database. Adding a count column would be a sort of data duplication not to mention require on additional code step and insert.
If you were to expect millions of comments, then you may want to pick the count column approach.

- 4,913
- 2
- 33
- 43
I agree with @Oded. It depends on the app requirements and also how active is the site, however here is also my two cents
- I would try to avoid the writes which will have to be done by triggers, UPDATES to post table when new comments are added.
- If you are concerned about reporting the data then don't do that on a transactional system. Create a reporting DB and update that periodically.

- 18,402
- 29
- 102
- 147
The "correct" way to design is to use another table, join it and COUNT
. This is consistent with what database normalization teaches.
The problem with normalization is that it cannot scale. There are only so many ways to skin a cat, so if you have millions of queries per day and a lot of them involve table X, the database performance is going below ground as the server also has to deal with concurrent writes, transactions, etc.
To deal with this problem, a common practice is sharding. Sharding has the side effect that the rows of a table are not stored in the same physical location, and a primary consequence of this is that you cannot JOIN
anymore; how can you JOIN
against half a table and receive meaningful results? And obviously, trying to JOIN
against all partitions of a table and merge the results is going to be worse than the disease.
So you see that not only the alternative you examine is used in practice to achieve high performance, but also that there are even more radical steps that engineers can and do take.
Of course, unless you do have performance issues, sharding or even de-normalizing is just making your life harder for no tangible benefit.