1

I'm aware this is normally a bad idea, and I've done my reading - in particular, this question.

However the total normalisation route seems more complex and will give me and my code more hoops to jump through. Here's my scenario:

I'm building a test creation system where users can create tests, questions and answers, and associate them all together, i.e. associate answers with questions, and questions with tests. This approach means there's no hard-linking any one kind of data to any other; a given question can be part of two or more tests, for example. So, I was thinking (simplified):

Tests table:

  • id (PK)
  • name (varchar)
  • questions (com-sep list of question IDs)

Questions table:

  • id (PK)
  • question text (varchar)
  • answers (com-sep list of answer IDs)

Answers table:

  • id (PK)
  • answer text (varchar)

So a given row in the tests table might look like:

---------------------------------------
| ID | NAME           | QUESTIONS     |
---------------------------------------
| 1  | SOME TEST      | 1,4,7,8,11,19 |
---------------------------------------

Then, when I fetch a test and its questions, I just do some magic with group concat.

Question: is this all a bad idea? It seems a lot simpler than the alternative which is to have two further tables dedicated, respectively, to logging associations between tests and questions, and questions and answers, meaning more tables involved in any queries.

Community
  • 1
  • 1
Mitya
  • 33,629
  • 9
  • 60
  • 107
  • 1
    Me personally, I would prefer the two additional tables. As your dataset grows the query run time for your design would be longer due to the fact you have to parse the `questions` and `answers` fields. A join to a sub table would be much faster and easier to manage. – Linger Jun 13 '14 at 18:32

2 Answers2

2

Yes, it probably is a bad idea.

Why do you think of having two more whole tables (whoa!!) as a big deal? It really isn't.

Anyway, if you're really definitely never going to want to do something like "find out which tests question 3 appears in" then go nuts, but the moment you do find you have to do something like that you'll wish you had just done it the right way.

And how will you make sure that your data is even halfway sensible? If 564 appears as an entry in one of your comma-separated lists, will you be sure that there is definitely a question number 564 in the Questions table, that it hasn't been deleted since? What a lot of extra complication to avoid creating two tables. If you don't like typing the SQL to perform the joins, you could just use an ORM.

Hammerite
  • 21,755
  • 6
  • 70
  • 91
  • I've no issue with joins. Re: 564, sure, but that works for tables, too; I could delete question 564 and omit to mention any records in the associations table for that question. Re: whoa!!, I didn't say it was a big deal, I said it was something to consider. Since it constitutes a 66.6% increase in the number of tables, it's hardly imprudent to consider the point. – Mitya Jun 13 '14 at 19:04
  • 1
    You really want to be using foreign key constraints to make sure that if question 564 belongs to one or more tests, either 1) it can't be deleted until it's removed from all of those tests (`ON DELETE RESTRICT`) or 2) if it gets deleted then it gets automatically removed from those tests (`ON DELETE CASCADE`). You can't do that using comma-separated storage of ids. Well, actually, maybe you could (very tediously and inefficiently) using triggers, but why would you do that to yourself? – Hammerite Jun 13 '14 at 19:14
  • 2 more tables or a 67% increase in tables, however you want to frame it, you shouldn't care about that small difference. – Hammerite Jun 13 '14 at 19:15
  • Whoa, `ON DELETE RESTRICT` and `ON DELETE CASCADE` are new concepts to me. +1 – Mitya Jun 13 '14 at 19:49
1

Sure, there are cases when denormalization is worthwhile.

But keep in mind that denormalization helps simplify a subset of queries against your data, at the expense of all other queries.

The scenarios listed in my answer to Is storing a delimited list in a database column really that bad? show how many other types of queries or updates you might have to do against your data. Searching, sorting, inserting, deleting... Also, relying on referential integrity to avoid your data turning to a collection of orphans.

But if you know that fetching or updating the whole list of id's is the only thing you need to optimize for, and this will never change (famous last words), then go for it, use denormalization.

If you want any of those other types of queries to be convenient or efficient, stick with a normalized design.

Community
  • 1
  • 1
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828