0

OK, I know the technical answer is NEVER.

BUT, there are times when it seems to make things SO much easier with less code and seemingly few downsides, so please here me out.

I need to build a Table called Restrictions to keep track of what type of users people want to be contacted by and that will contain the following 3 columns (for the sake of simplicity):

minAge
lookingFor
drugs

lookingFor and drugs can contain multiple values.

Database theory tells me I should use a join table to keep track of the multiple values a user might have selected for either of those columns.

But it seems that using comma-separated values makes things so much easier to implement and execute. Here's an example:

Let's say User 1 has the following Restrictions:

minAge => 18
lookingFor => 'Hang Out','Friendship'
drugs => 'Marijuana','Acid'

Now let's say User 2 wants to contact User 1. Well, first we need to see if he fits User 1's Restrictions, but that's easy enough EVEN WITH the comma-separated columns, as such:

First I'd get the Target's (User 1) Restrictions:

SELECT * FROM Restrictions WHERE UserID = 1

Now I just put those into respective variables as-is into PHP:

$targetMinAge = $row['minAge'];
$targetLookingFor = $row['lookingFor'];
$targetDrugs = $row['drugs'];

Now we just check if the SENDER (User 2) fits that simple Criteria:

COUNT (*) 
   FROM Users
WHERE 
   Users.UserID = 2 AND
   Users.minAge >= $targetMinAge AND
   Users.lookingFor IN ($targetLookingFor) AND
   Users.drugs IN ($targetDrugs)

Finally, if COUNT == 1, User 2 can contact User 1, else they cannot.

How simple was THAT? It just seems really easy and straightforward, so what is the REAL problem with doing it this way as long as I sanitize all inputs to the DB every time a user updates their contact restrictions? Being able to use MySQL's IN function and already storing the multiple values in a format it will understand (e.g. comma-separated values) seems to make things so much easier than having to create join tables for every multiple-choice column. And I gave a simplified example, but what if there are 10 multiple choice columns? Then things start getting messy with so many join tables, whereas the CSV method stays simple.

So, in this case, is it really THAT bad if I use comma-separated values?

****ducks****

Community
  • 1
  • 1
ProgrammerGirl
  • 3,157
  • 7
  • 45
  • 82
  • It **seems** seems really easy and straightforward and indeed it may be. The other solution (with many JOINs) may look difficult and not straightforward - and indeed it may be difficult and not straightforward. But it has all the other advantages the link (you already have read) describes, like speed, integrity enforcement, etc. – ypercubeᵀᴹ Sep 22 '11 at 17:02
  • 1
    My advice is: Read that link again. – ypercubeᵀᴹ Sep 22 '11 at 17:02
  • 3
    basic rule of thumb: if you EVER need to access ANY of the individual parts of a comma-separated-list, then do NOT use a comma-separated-list. Doing so destroys the point of using relational databases in the first place. You've rendered the data non-relational, so now the DB's reduced to being a very large/expensive dumb storage system. – Marc B Sep 22 '11 at 17:03
  • And I really do not understand how the `Users.lookingFor IN ($targetLookingFor)` will match userA's `'Hang Out,Friendship'` with userB's `'Friendship,Football'` – ypercubeᵀᴹ Sep 22 '11 at 17:06
  • @ypercube: Most of the stuff in that link didn't seem to apply to this simple case. Speed isn't an issue as I've tested the CSV method and it is very fast. – ProgrammerGirl Sep 22 '11 at 17:07
  • @Marc B: I will not only NEVER have to access any individual part of the CSV lists, but they will ALWAYS be ONE-WAY accesses where I compare a SINGLE value against the CSV list using MySQL's IN function. So is this method really THAT bad in this particular case considering how much easier it seems to make things? – ProgrammerGirl Sep 22 '11 at 17:07
  • @ypercube: `Users.lookingFor`would be a SINGLE value (e.g. what that person is looking for, like 'Friendship'), and `$targetLookingFor` would contain the CSV list that MySQL can naturally interpret via the IN function. Hope that makes sense. – ProgrammerGirl Sep 22 '11 at 17:08
  • It makes it easier for you, doesn't make it easier for the DB. By embedding the CSV list into the query like that, you completely gut the ability to use indexes. If it was a proper sub-table, you'd do a simple join and the db could use indexes. Now it has to scan that list for EVERY row to see if the row's value is in the list. – Marc B Sep 22 '11 at 17:09
  • @Marc B: Why would I need an index on the CSV values? I would just need to index the `UserID` column of the `Restrictions` table to get all the Target's restrictions, and then simply do an `IN` check on the CSV columns like in my example. Seems simple enough, no? – ProgrammerGirl Sep 22 '11 at 17:12
  • @Programmer: You seem to have your mind set to actually do things your way. Try it. With a few (lets say a million) users. – ypercubeᵀᴹ Sep 22 '11 at 17:14
  • @ypercube: Not set on it at all, I just want concrete reasons why I shouldn't that takes into account my example. That's all. My way just seems so uncomplicated vs. doing a new join table for each Multiple-Value criteria, that I really want to understand the tangible benefits of doing it the "long" way and how that will benefit what I'm trying to do specifically. – ProgrammerGirl Sep 22 '11 at 17:17
  • @Programmer: You need an index because if you're searching by that field, an index keeps MySQL from having to read in each row of the table, split the column into discrete values, and check the search string against each of those values. Only, MySQL's going to have to do that anyway with CSVs, because it indexes strings starting from the beginning of the whole thing -- not the beginning of each word. Once you have a bunch of users, queries like that can get slow as hell. – cHao Sep 22 '11 at 17:22
  • @cHao: But I don't need to index the CSV columns, I would only need to index the UserID column of the "Restrictions" table to quickly pull all the Restrictions (including the CSV ones) from the target user as in my example. – ProgrammerGirl Sep 22 '11 at 17:23
  • So guys, with all the clarifications I gave above, what are the tangible reasons NOT to go the CSV route? The only one I've heard so far is it will cause a headache if I switch from PHP to something else, but that's not going to happen in my case. So, what are the other tangible drawbacks? – ProgrammerGirl Sep 22 '11 at 17:38
  • @Programmer: I'd rather let the database do that and minimize the amount of PHP code i have to write and test. Oh, wait, now i can't reasonably do that for reasons mentioned above and below. Thanks a lot. :/ – cHao Sep 22 '11 at 17:41
  • @cHao: I understand that ideally the DB should do everything, but the amount of time it would take me to implement it vs. my solution is something I don't want to have to do if I don't absolutely have to (as I have many other things to prioritize, and time is money in the "real world"), and so far I'm not convinced that my method is all that bad considering how it will be used in my very specific case. I'm still waiting for a compelling reason. – ProgrammerGirl Sep 22 '11 at 17:45
  • 2
    @Programmer: You have a choice, of course -- you could skip doing stuff correct in order to get stuff done, and for now it will work. But you're going to be putting in the time later, if the site's worth a damn, because users are going to want more stuff. Like, say, recommended contacts. And you're going to have to code some way around the limitations of your database in order to implement it. By the end of it all, you'll have written a half-assed database engine in PHP and just be using MySQL as a relatively expensive raw data store. Trust me, i've been there. It's no fun. – cHao Sep 22 '11 at 18:05
  • CSV is sort of a native data format in SQL. You could rather use [`FIND_IN_SET()`](http://www.google.com/search?q=FIND_IN_SET) to utilize bound params there. -- This is oftentimes practical for storage-only attributes. If the performance is acceptable if you query these fields a lot is another question. (You'll have to test that. The excessive database normalization advisal is not very practical by itself.) – mario Sep 22 '11 at 18:17

2 Answers2

7

You already know the answer.

First off, your PHP code isn't even close to working because it only works if user 2 has only a single value in LookingFor or Drugs. If either of these columns contains multiple comma-separated values then IN won't work even if those values are in the exact same order as User 1's values. What do expect IN to do if the right-hand side has one or more commas?

Therefore, it's not "easy" to do what you want in PHP. It's actually quite a pain and would involve splitting user 2's fields into single values, writing dynamic SQL with many ORs to do the comparison, and then doing an extremely inefficient query to get the results.

Furthermore, the fact that you even need to write PHP code to answer such a relatively simple question about the intersection of two sets means that your design is badly flawed. This is exactly the kind of problem (relational algebra) that SQL exists to solve. A correct design allows you to solve the problem in the database and then simply implement a presentation layer on top in PHP or some other technology.

Do it correctly and you'll have a much easier time.

Larry Lustig
  • 49,320
  • 14
  • 110
  • 160
  • Let's say I will always use PHP, is this still that big of an issue when it's just a few lines of code while making things so much easier on the development front? After all, those columns will only ever be used for one-way checking (checking SINGLE values against MULTIPLE values using MySQL's IN function). – ProgrammerGirl Sep 22 '11 at 17:05
  • 2
    That's not true even with the example you gave, unless User 1 is never allowed to attempt to contact anyone. That record *already* has multiple values in both columns – Larry Lustig Sep 22 '11 at 17:11
  • I don't think I expressed myself clearly, every User has their SINGLE "lookingFor" and "drugs" column set on the main "Users" table. So if User 1 were to contact someone, I would simply check User 1's columns in the main User table (which contains only SINGLE values), against the TARGET'S Multiple-Value columns in the "Restrictions" table, which can be easily accomplished using MySQL's IN function. – ProgrammerGirl Sep 22 '11 at 17:21
  • 2
    If you are willing to limit yourself to the single requirement you've presented, you'll be able to solve your problem in a reasonably performant fashion (no more performant that the correct way, but you won't take a big hit). But you will not be able to answer any of the other questions you will eventually want to look at (eg, "suggest appropriate contacts for a user"). Making a poor design decision because it will get you past version 0.1 of your application will cost you a lot of time and money to fix later. – Larry Lustig Sep 22 '11 at 17:47
1

Suppose User 1 is looking for 'Hang Out','Friendship' and User 2 is looking for 'Friendship','Hang Out'

Your code would not match them up, because 'Friendship','Hang Out' is not in ('Hang Out','Friendship')

That's the real problem here.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • I don't really see that as a problem because I'm never checking multiple values against multiple values, I'm ALWAYS checking a SINGLE value against MULTIPLE values, which can be easily done with MySQL's IN function. Any other major drawbacks I should consider? – ProgrammerGirl Sep 22 '11 at 17:03
  • Thing is, you can't say that. You said yourself, the `lookingFor` column could contain multiple values. If your variable is a single value, that means the `IN` is inverted. And if you flip it around to `$targetLookingFor IN Users.lookingFor`, then you've just forced a table scan (read: killed performance), because you can't use indexes anymore. – cHao Sep 22 '11 at 17:13
  • @cHao: I don't think I expressed myself clearly, every User has their SINGLE "lookingFor" and "drugs" column set on the main "Users" table. So if a User were to contact someone, I would simply check User 1's columns in the main User table (which contains only SINGLE values), against the TARGET'S Multiple-Value columns in the "Restrictions" table, which can be easily accomplished using MySQL's IN function. – ProgrammerGirl Sep 22 '11 at 17:22
  • So, wait...you're reading in User 1's restrictions, and then doing a query to see if he matches against User 2? Even though User 2 is the one who's logged in, and thus whose info you already have? Oy...this whole thing sounds backwards to me. – cHao Sep 22 '11 at 17:31
  • @cHao: I need to read the Targets restrictions one way or another, that doesn't change. What I'm trying to find out is if storing the multiple-value restrictions as CSV columns in a single Restrictions DB is really THAT bad considering that the queries will always be ONE-WAY checking if a SINGLE value is in a CSV...I still haven't gotten a straight answer because everyone thought I was going to be doing CSV against CSV checks, which is NOT the case. – ProgrammerGirl Sep 22 '11 at 17:35
  • @Programmer: Actually, the amount of data you have to read is nearly nil if the table's decently normalized. `SELECT COUNT(*) FROM RestrictLookingFor WHERE UserID = 1 AND lookingFor = 'Hanging Out';` returns one single int that tells you whether the user wants to be contacted. A join increases the complexity marginally, and still gives you damn good performance -- one decently indexed query is almost always considerably better than two. – cHao Sep 22 '11 at 17:47