0

Lets say you have 20 options you want an end user to choose from. Based on those choices, calculations will be done on them to create a percentage. What would make better sense in terms of query performance when running query calculations?

a) Build one column and have all the selected values comma separated within that column.
b) Have each option as its own column in the table

c) Have a linking table that marks the choices back to the users ID?

If this should be asked somewhere else, let me know and I will delete this and move it to the appropriate location.

user3241191
  • 505
  • 2
  • 5
  • 12

4 Answers4

1

It all depends and I'm assuming we are talking about a sql data store.

If you need to be able to quickly query on things like which users have selected which option then then you'll either want separate columns in your table or another table with, say three columns.

userId
optionId
optionValue

optionId would identify the option ('option1', 'option2', etc.). This approach has the advantage of allowing you to add option types without modifying your database schema. This might be an advantage depending on your environment (like if adding a database column requires a lengthy change control process).

Likely the separate table approach would be somewhat slower, but I don't think this is a "difference that makes a difference". I would almost always use the representation that was the easiest to work with for my application logic,

If you really are interested in only the final percentage and never need to query on the options, then I think a comma separated list of strings is probably fine. a DBA might disagree with that opinion :).

Robert Moskal
  • 21,737
  • 8
  • 62
  • 86
  • Normally I have gone with the linking table approach as to always make my designs as relational as possible. Though this would be kind of a hybrid reporting/relational type storage. I know in reporting type storage, things are less linked for better retrieval times. "This approach has the advantage of allowing you to add option types without modifying your database schema". This is the normal approach I take, as I usual try to build things that build themselves but I also do not want to sacrifice performance. – user3241191 Feb 26 '14 at 17:18
1

a) probably violates the principle of atomicity and therefore the 1NF, leading to multitude of problems.

The choice between b) and c) depends on how dynamic the values should be:

  • c) is more appropriate if you want to be able to dynamically "grow" the set of permitted values,
  • while b) may be more straightforward (and performant) for the static case.

--- UPDATE ---

OK, let me touch the query performance aspect a bit.

By using b) or c), you can do AVG, SUM and other aggregate functions directly in the DBMS, then return the result alone. That conserves network bandwidth, which tends to be scarcer resource than CPU.

In case of a), you'd probably implement it by fetching all the "fat" data to the client and doing the calculations there (unpacking and aggregating data on the server would require some acrobatics - not impossible but certainly more fragile and less maintainable). And as Jonathan Van Matre rightfully pointed out, serializing into a string, as opposed to using the native binary storage format, is not exactly the most compact way of representing the data to begin with.

Community
  • 1
  • 1
Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
  • a or c would be more of the approaches I would take. A would more go on the lines of reporting style design where as c would conform more to the relational style, which is the one I normally go with due to it being more scalable. – user3241191 Feb 26 '14 at 17:20
  • Well, unless you you guarantee you **never** need to access (read or write) individual options in isolation, a) violates the 1NF and the link I provided lists the clear disadvantages of that approach. It doesn't really matter how quickly you get an answer, if the data is corrupt so you get the wrong answer. Bottom line, I would be **very** weary of a). – Branko Dimitrijevic Feb 26 '14 at 17:26
  • Like I said, the choice between b) and c) rests with the anticipated evolution of the data model in the future. Which is more "scalable" depends on how you define that term. For example b) may be more scalable performance-wise by avoiding JOINs, then again that has to be balanced against the "scalability" for evolving the model. – Branko Dimitrijevic Feb 26 '14 at 17:27
  • Throughout my career, I have never used option a to the best of my knowledge before which is why I brought it up. I was initially trying it to see if I would incur any performance increases thus why I asked it here as well to get insight. Also, data can get corrupted regardless of design. That is up to good application design to make sure data corruption doesn't take place within the application. I do thank you for all the information and insight. – user3241191 Feb 26 '14 at 17:33
  • _"good application design to make sure data corruption doesn't take place"_ - That's jut one layer in what is supposed to be "layered defense" against data corruption. The database itself is another, and arguably more potent, when it can be used. One of the dangers of breaking the 1NF is in disabling the database's ability to protect itself from bad data. – Branko Dimitrijevic Feb 26 '14 at 18:01
1

The key here is that you are doing calculations against this data.

There are a variety of reasons why you should not overload a column by storing a CSV of multiple data points in it, but the controlling reason in this case is performance.

If you are basing calculations in the database on the values of these data points, you will absolutely want to utilize separate columns for them. If you store them as a single CSV string, then every calculation will incur the cost of unpacking the CSV and converting the characters to numeric values. The first operation is particularly expensive, in every DBMS I have ever used. It's the sort of thing that makes DBAs curse and throw things at the wall.

If, on the other hand, the data points are already stored as numeric values in independent columns, those overhead costs are not incurred and you can simply do basic math to calculate a numeric result from them.

Take it from a DBA: The cost of storing separate columns (or even a separate CustomerChoices table) is not zero, but it is still far less than the cost incurred by dealing with unpacking overloaded columns and doing data type conversions.

(In addition, when you store CSV data you are storing unnecessary bytes that are not even data: all of the commas. That too has a cost.)

The two prior answers may have invoked query performance in support of the CSV solution, but they have overlooked the key point that the data must be employed in numerical calculations.

If you could do the number crunching in your application and store only the computed percentage and the choices to your DB store, then you could maybe get away with storing the choices as a CSV. But only when a DBA isn't looking.

Even in that case, any DBA worth their salt will argue for separate numeric columns anyway, because at some time in the future there will be a request to the DBA to reproduce the application's calculation in the database for an auditing query, and they will be out of things to throw at the wall because they already smashed all their favorite things.

0

In relation to query performance I would suggest you build a column and have all the selected values comma separated within that column

Spaced
  • 55
  • 5