3

I'm trying to build a activity stream which has the following structure :

------------------------------------------------------------------------------------
id | activity_by_user_id | activity_by_username | ... other activity related columns
------------------------------------------------------------------------------------

Is this a good approach to store the activity_by_username too in the activity table ? I understand that this will clutter up the table with the same username again and again. But If not, I will have to do a join with the users table to fetch the username.

The username in my web application never changes.

With this, I will no longer have to join this table with the users table. Is this an optimum way of achieving what I need ?

YD8877
  • 10,401
  • 20
  • 64
  • 92
  • you should not denormalize like this 'just to avoid a join' what other important reasons do you have for this idea? have you measured any performance degradation? – Randy Jun 27 '14 at 18:45
  • I have a join of about 10 tables already. Just trying to avoid an additional join. – YD8877 Jun 27 '14 at 18:46
  • +1 . . . I upvoted the question because I think this is a valid question to ask on Stack Overflow, so it does not deserve a downvote. – Gordon Linoff Jun 27 '14 at 18:50
  • Valid question, but never asked before? – Marcus Adams Jun 27 '14 at 18:57

4 Answers4

2

What you are proposing is to denormalize the data structure. There are advantages and disadvantages to this approach.

Clearly, you think that performance will be an advantage, because you will not need to look up the username on each row. This may not be true. The lookup should be on the primary key of the table and should be quite fast. There are even situations where storing the redundant information could slow down the query. This occurs when the field size is large and there are many apps with the same user. Then you are wasting lots of storage on redundant data, increasing the size of the table. Normally, though, you would expect to see a modest -- very modest -- improvement in performance.

Balanced against that is the fact that you are storing redundant data. So, if the user name were updated, then you would have to change lots of rows with the new information.

On balance, I would only advise you to go with such an approach if you tested it on real data in your environment and the performance improvement is worth it. I am skeptical that you would see much improvement, but the proof is in the pudding.

By the way, there are cases where denormalized data structures are needed to support applications. I don't think that looking up a field using a primary key is likely to be one of them.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Agreed. Denormalizing is a valid (and standard) approach in various situations, and PK lookups usually don't fit. Where the number is both small and unchanging, it probably won't have a significant impact either way. But if the values **ever** change, it can be a real mess. – user2338816 Jun 28 '14 at 00:00
2

There isn't a single answer to your question*

In general, relational database design seeks to avoid redundancy to limit the opportunities for data anomalies. For example, you now have the chance that two given rows might contain the same user id but different user names. Which one is correct? How do you prevent such discrepancies?

On the other hand, denormalization by storing certain columns redundantly is sometimes justified. You're right that you avoid doing a join because of that. But now it's your responsibility to make sure data anomalies don't creep in.

And was it really worth it? In MySQL, doing a join to look up a related row by its primary key is pretty efficient (you see this as a join type "eq_ref" in EXPLAIN). I wouldn't try to solve that problem until you can prove it's a bottleneck.

Basically, denormalization optimizes one type of query, at the expense of other types of queries. The extra work you do to prevent, detect, and correct data anomalies may be greater than any efficiency you gain by avoiding the join in this case. Or if usernames were to change sometimes, you'd have to change them in two places now (I know you said usernames don't change in your app).

The point is it depends entirely on your how frequently different queries are run by your application, so it's not something anyone can answer for you.


* That might explain why some people are downvoting your question -- some people in StackOverflow seem to have a rather strict idea about what is a "valid" question. I have seen questions closed or even deleted because they are too subjective and opinion-based. But I have also seen questions deleted because the answer is too "obvious". One of my answers with 100 upvotes was lost because a moderator thought the question of "Do I really need version control if I work solo?" was invalid. Go figure. I copied that one to my blog here.

Community
  • 1
  • 1
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
1

I think it is bad idea. Databases are optimized for joins (assuming you did your job and indexed correctly) and denormalized data is notoriously hard to maintain. There may be no username changes now but can you guarantee that for the future, no. Risking your data integrity on such a thing is short-sighted at best.

Only denormalize in rare cases where there is an existing performance problem and other optimitization techniques have failed to improve the situation. Denormalizing isn't even always going to get you any performance improvement. As the tables get wider, it may even slow down performance. So don't do it unless you havea measuable performance problem and you measure and ensure the denormlaization actually helps. It is the last optimation technique to try out of all of them, so if you haven't gone through all the optimation techniques in the very large list of possibilities, first, then denormalization should not be an option.

HLGEM
  • 94,695
  • 15
  • 113
  • 186
1

No. This goes against all principles of data normalization.

And it won't even be that difficult (if I'm interpreting what you mean by id, user_id, and user_name); id will be the primary key tying everything together - and the linchpin of your JOINs. So you'll have your main table with id, other activity col, next activity col, etc. (not sure what you mean by activity). Then a 2nd table with just id and user_id and a third with id and username). And when you want to output whatever you're going to output, and do it by user_id or username, you'll just JOIN (implied join syntax - WHERE table1.id = table2.id).

user3741598
  • 297
  • 1
  • 12