414

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.

Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.

I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?

Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.

Mad Scientist
  • 18,090
  • 12
  • 83
  • 109
  • 32
    in that case, why bothering database?, saving in a file will do. – thavan Feb 12 '13 at 09:31
  • 8
    Agreed with @thavan. Why even save the data for a proof of concept? Once you have the proof complete, then add a database correctly. Your fine doing lightweight for proof of concept, just don't make things you have to unmake later. – Jeff Davis Aug 09 '13 at 14:10
  • 2
    In Postgres, an array column should be preferred over a comma separated list. That at least ensures the proper data type, has no problems with distinguishing the delimiter from the actual data and it can be efficiently indexed. –  Aug 30 '19 at 06:32
  • @a_horse_with_no_name - what would be the case if the column with the text content is GIN indexed (which is a one time operation)? thanks – fnisi Nov 21 '22 at 04:48

10 Answers10

648

In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:

  • Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
  • Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
  • Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
  • Can’t delete a value from the list without fetching the whole list.
  • Can't store a list longer than what fits in the string column.
  • Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
    idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
  • Hard to count elements in the list, or do other aggregate queries.
  • Hard to join the values to the lookup table they reference.
  • Hard to fetch the list in sorted order.
  • Hard to choose a separator that is guaranteed not to appear in the values

To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.

Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.

There are times when you need to employ denormalization, but as @OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 11
    An ARRAY (of any datatype) can fix the exception, just check PostgreSQL: http://www.postgresql.org/docs/current/static/arrays.html (@Bill: Great book, a must read for any developer or dba) – Frank Heikens Nov 24 '11 at 21:18
  • 1
    For PostgreSQL-specific discussion see http://dba.stackexchange.com/q/55871/7788 . Comma-separated is just as awful, but an array field can be an acceptable performance optimisation under some circumstances if applied carefully and with consideration of the consequences. – Craig Ringer Dec 31 '13 at 08:54
  • 2
    @CraigRinger, yes, it's a type of denormalization. When used carefully, denormalization can be just the right thing to do for a certain query you are trying to optimize, but it must be done with full understanding that it harms other queries. If those other queries aren't important to your application, then the pain is less. – Bill Karwin Sep 25 '14 at 05:32
  • 1
    @Jay, thanks, that just points out another weakness: that comma-separated lists may not be sorted properly. – Bill Karwin Feb 20 '15 at 05:42
  • 3
    I know its not recommended, but playing devils advocate: most of these can be taken off if there is a ui that handles uniqueness and data types (otherwise would error or misbehave), ui drops and creates it anyway, there is a driver table where the values come from to make them unique, field like '%P%' can be used, values being P, R, S, T, counting doesn't matter, and sorting doesn't matter. Depending on ui, values can be split[] e.g. to check checkboxes in a list from driver table in least common scenario without having to go to another table to get them. – jmcclure Sep 23 '15 at 01:09
  • 1
    MySQL makes it pretty easy to search using `FIND_IN_SET()`. – shmosel Jan 11 '17 at 08:37
  • 2
    @shmosel, yes, that's easy, but it cannot be optimized. Searching for a substring cannot use an index. – Bill Karwin Jan 11 '17 at 16:18
  • @BillKarwin did you think ever about performance. I have 174 list for feature and if each data I have to store in DB by your approach it will load 174 column but if we use comma separated value store then quickly solve and as well reduce load in DB-engine. I am just as asking as devil advocate don't think otherwise. – Prabhu Nandan Kumar Feb 28 '18 at 12:00
  • 6
    @PrabhuNandanKumar, I would store 174 *rows* in a second table that references your first table. Do not store 174 columns with similar data. – Bill Karwin Feb 28 '18 at 16:33
  • What about the JSON Data Type then...? – SuN Mar 01 '19 at 11:47
  • 1
    @SuN, you might be interested in my presentation [How to Use JSON in MySQL Wrong](https://www.slideshare.net/billkarwin/how-to-use-json-in-mysql-wrong). – Bill Karwin Mar 01 '19 at 13:56
  • Postgres json datatype fixes those problems. Go for it. Vive la liberté! – Bohemian Oct 07 '20 at 21:56
  • The 1st and 3rd bullet point are not necessarily true, since those *can* (and should) also be handled by the front-end interface (data validation). So I'd suggest they read "can be difficult to" rather than "can't". – TylerH Dec 07 '20 at 16:42
  • 1
    @TylerH How can you be sure that _all_ the data inserted into the table was formatted correctly? It could have been inserted by some other client, not the front-end that you designed. Perhaps even the data you insert today will be updated later by some other client, and they will make it malformed. – Bill Karwin Dec 07 '20 at 16:59
  • @BillKarwin You check the frontend that has a connection to your database, which, as I said, should be validating/sanitizing the inputs on its end. I don't disagree with you, but I think there's an important distinction in the real world between "can't" and "difficult to", especially when deadlines are more important than best practice compliance :-) Not everyone creating databases (or reading this Q&A series) has the time/knowledge/aptitude to do it the best way... they may be in a crunch, not have time to learn before providing a product, etc. Just want them to have the most accurate info. – TylerH Dec 07 '20 at 17:02
  • 2
    I agree there's an important distinction between "can't" and "difficult to." For example, in this case you _can't_ enforce that the data remain formed correctly, because any client can update them to be malformed. – Bill Karwin Dec 07 '20 at 17:22
  • 4
    Another way of viewing this: the challenge is not to design a database that allows one well-behaved client to insert data formed correctly. The challenge is to ensure that all data inserted by any client is always formed correctly. – Bill Karwin Dec 07 '20 at 17:28
  • 1
    I wish I could down vote comments that demonstrate the Dunning Kruger effect. Instead, I can keep up voting Bill Karwin. *(He's an actual expert, stop asserting that you're right and the expert is wrong, start asking what you are missing that the expert has accounted for. Only after a LOT of skeptical searching for your own mistake, including asking Bill Karwin, should you even Begin to consider asserting that you're right.)* – MatBailie Dec 26 '22 at 22:47
49

"One reason was laziness".

This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.

Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.

(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)

cHao
  • 84,970
  • 20
  • 145
  • 172
Hammerite
  • 21,755
  • 6
  • 70
  • 91
  • 3
    I always hear some people saying that "my design is more flexible than yours" when I confront them about things like not setting up foreign key constraints, or storing lists in a single field. To me, flexibility (in such cases) == no discipline == laziness. – foresightyj Jan 30 '15 at 03:28
45

There are numerous questions on SO asking:

  • how to get a count of specific values from the comma separated list
  • how to get records that have only the same 2/3/etc specific value from that comma separated list

Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...

These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
19

In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...

In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?

bobbymcr
  • 23,769
  • 3
  • 56
  • 67
  • Performance wise, that would scale better than the intermediate table. You're looping over N and not over NxM (if M is the average number of category per items) – Vincent Apr 28 '23 at 06:16
11

Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.

It breaks first normal form.

A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.

What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.

Or leave it as it is and learn the painful lesson of a SQL injection attack.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • 22
    I don't see anything in this question that suggests he is vulnerable to SQL injection. SQL injection and database normalisation are orthogonal topics, and your digression on injection is irrelevant to the question. – Hammerite Sep 06 '10 at 18:24
  • The input is escaped, and anybody who has access to this application already has easier ways of wreaking havoc. I'm using Drupal db_query to access the database, seperately supplying the parameters. – Mad Scientist Sep 06 '10 at 18:25
  • @Hammerite, even if this particular bit of laziness and unwillingness to learn doesn't lead to an SQL injection, other examples of the same attitude will. – Paul Tomblin Sep 06 '10 at 18:26
  • @Hammerite, there's nothing to exclude the possibility, either. I thought it was worth bringing up in case the OP's ignorance extended to SQL injection as well. I agree that normalization and SQL injection can be orthogonal, but without other information it seemed to me that it should be mentioned. It's hardly irrelevant. – duffymo Sep 06 '10 at 18:28
  • 5
    @Paul: And maybe the same attitude will lead to him being hit by a bus when he fails to look both ways before crossing the street, but you haven't warned him about that. Edit: I had thought you were the poster of this answer, my mistake. – Hammerite Sep 06 '10 at 18:29
8

I needed a multi-value column, it could be implemented as an xml field

It could be converted to a comma delimited as necessary

querying an XML list in sql server using Xquery.

By being an xml field, some of the concerns can be addressed.

With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5

With XML: values in a tag can be forced to be the correct type


With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.

With XML: still an issue


With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5

With XML: still an issue


With CSV: Can't delete a value from the list without fetching the whole list.

With XML: single items can be removed


With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.

With XML: xml field can be indexed


With CSV: Hard to count elements in the list, or do other aggregate queries.**

With XML: not particularly hard


With CSV: Hard to join the values to the lookup table they reference.**

With XML: not particularly hard


With CSV: Hard to fetch the list in sorted order.

With XML: not particularly hard


With CSV: Storing integers as strings takes about twice as much space as storing binary integers.

With XML: storage is even worse than a csv


With CSV: Plus a lot of comma characters.

With XML: tags are used instead of commas


In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed

Community
  • 1
  • 1
James A Mohler
  • 11,060
  • 15
  • 46
  • 72
6

Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.

Robin
  • 4,242
  • 1
  • 20
  • 20
4

Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.

Raj
  • 1,742
  • 1
  • 12
  • 17
  • 16
    No, it's a horrible idea. You've managed to get away with it, but the cost of your few minutes of development time has cost you lousy query performance, flexibility, and maintainability of your code. – Paul Tomblin Sep 06 '10 at 18:28
  • 5
    Paul, I agree. But as I said I used if for a specific purpose, and that is for a data entry operation where you have many kinds of forms. I am revising the design now that I have learnt NHibernate but back then I needed the flexibity to design the form in ASP.NET and use the textbox ids as key in the key/value pair. – Raj Sep 07 '10 at 06:58
  • 35
    Telling someone who has maintained the app for 4 years about maintenance concerns is a bit presumptuous. There's very few "horrible" ideas in sw development - mostly they're just ideas with very limited applicability. It's reasonable to warn folks on the limitations, but chastising those who have done it and lived through it strikes me as a holier-than-thou attitude I can do without. – Mark Brackett Jul 09 '13 at 22:55
0

I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
-1

If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

Solomon Ucko
  • 5,724
  • 3
  • 24
  • 45
  • 5
    `INT(1)` takes 4 bytes; the `(1)` is meaningless. – Rick James May 23 '20 at 23:54
  • 1
    How many bytes an INT(1) takes is product specific, and also what INT(1) means. Can be one digit, one byte, one word, or perhaps something else? – jarlh Sep 25 '21 at 19:09