37

I think I found the answer to my question, I'm just unsure of the syntax, I keep getting SQL errors.

Basically, I want to do the opposite of IN. Take this example:

SELECT * 
  FROM users INNER JOIN 
       tags ON tags.user_id = users.id 
 WHERE tags.name IN ('tag1', 'tag2');

The above will return ANY users that have 'tag1' OR 'tag2'. I want users that have BOTH. They must have both tags to be returned. I'm assuming the keyword ALL should be used, but can't get it to work.

Thanks for your help.

Peter Brown
  • 50,956
  • 18
  • 113
  • 146
Binary Logic
  • 2,562
  • 7
  • 31
  • 39

11 Answers11

41

Let's talk about this problem in generalities first, then specifics.

In this problem what you want to do is select rows from table A depending on conditions in two (or for the general case, more than two) rows in table B. In order to accomplish this, you need to do one of two things:

  1. execute tests against different rows in table B

  2. aggregate the rows of interest from table B into a single row which somehow contains the information you need to test the original rows from table B

This kind of problem is the big reason why, I think, you see people creating comma-delimited lists in VARCHAR fields instead of normalizing their databases correctly.

In your example, you want to select user rows based on the existence of rows matching two specific conditions in tags.

(1) Testing different rows.

There are three ways you can use technique (1) (testing different rows). They are using EXISTS, using sub-queries, and using JOINs:

1A. Using EXISTs is (in my opinion, anyway) clear because it matches what you're trying to do — checking for existence of rows. This is moderately scalable to more tags in terms of writing the SQL creation if you're generating dynamic SQL, you simple add an additional AND EXISTS clause for each tag (performance, of course, will suffer):

SELECT * FROM users WHERE 
  EXISTS (SELECT * FROM tags WHERE user_id = users.id AND name ='tag1') AND
  EXISTS (SELECT * FROM tags WHERE user_id = users.id AND name ='tag2')

I think this clearly expresses the intent of the query.

1B Using sub-queries is also pretty clear. Because this technique does not involve correlated sub-queries some engines can optimize it better (it depends, in part, on the number of users with any given tag):

SELECT * FROM users WHERE 
  id IN (SELECT user_id FROM tags WHERE name ='tag1') AND
  id IN (SELECT user_id FROM tags WHERE name ='tag2') 

This scales the same way that option 1A does. It's also (to me, anyway) pretty clear.

1C Using JOINs involves INNER JOINing the tags table to the users table once for each tag. It doesn't scale as well because it's harder (but still possible) to generate the dynamic SQL:

SELECT u.* FROM users u 
     INNER JOIN tags t1 ON u.id = t1.user_id
     INNER JOIN tags t2 ON u.id = t2.user_id
  WHERE t1.name = 'tag1' AND t2.name = 'tag2'

Personally, I feel this is considerably less clear than the other two options since it looks like the goal is to create a JOINed record set rather than filter the users table. Also, scalability suffers because you need to add INNER JOINs and change the WHERE clause. Note that this technique sort of straddles techniques 1 and 2 because it uses the JOINs to aggregate two rows from tags.

(2) Aggregating rows.

There are a two main ways of doing this, using COUNTs and using string processing:

2A Using COUNTs is much easier if your tags table is "protected" against having the same tag applied twice to the same user. You can do this by making (user_id, name) the PRIMARY KEY in tags, or by creating a UNIQUE INDEX on those two columns. If the rows are protected in that way you can do this:

 SELECT users.id, users.user_name 
   FROM users INNER JOIN tags ON users.id = tags.user_id
   WHERE tags.name IN ('tag1', 'tag2')
   GROUP BY users.id, users.user_name
   HAVING COUNT(*) = 2

In this case you match the HAVING COUNT(*) = test value against the number of tags name in the IN clause. This does not work if each tag can be applied to a user more than once because the count of 2 could be produced by two instances of 'tag1' and none of 'tag2' (and the row would qualify where it shouldn't) or two instances of 'tag1' plus one instance of 'tag2' would create a count of 3 (and the user would not qualify even though they should).

Note that this is the most scalable technique performance-wise since you can add additional tags and no additional queries or JOINs are needed.

If multiple tags are allowed you can perform an inner aggregation to remove the duplicates. You can do this in the same query I showed above, but for simplicity sake I'm going to break the logic out into a separate view:

 CREATE VIEW tags_dedup (user_id, name) AS
 SELECT DISTINCT user_id, name FROM tags

and then you go back to the query above and substitute tags_dedup for tags.

2B Using String processing is database specific because there is no standard SQL aggregate function to produce string lists from multiple rows. Some databases, however, offer extensions to do this. In MySQL, you can use GROUP_CONCAT and FIND_IN_SET to do this:

SELECT user.id, users.user_name, GROUP_CONCAT(tags.name) as all_tags
  FROM users INNER JOIN tags ON users.id = tags.user_id
  GROUP BY users.id, users.user_name
  HAVING FIND_IN_SET('tag1', all_tags) > 0 AND
         FIND_IN_SET('tag2', all_tags) > 0 

Note, this is very inefficient and uses MySQL unique extensions.

Larry Lustig
  • 49,320
  • 14
  • 110
  • 160
  • ah. you've done what I've done, listed different solutions. With a better analysis of the problem though... – gbn Jan 22 '11 at 15:18
  • Yes, but you get _extra_ kudos for INTERSECT. This question gets asked a lot, I'm trying to write a definitive answer that I can link to in the future. – Larry Lustig Jan 22 '11 at 15:31
  • @LarryLustig what if you have a a joiner table where a tag has many posts through post_tags, for example. How would this work? I tried several examples here, but no luck: http://stackoverflow.com/questions/25606775/how-to-find-posts-tagged-with-more-than-one-tag-in-rails-and-postgresql/25608193#25608193 – Jumbalaya Wanton Sep 02 '14 at 12:39
24

You'll want to join the tags table again.

SELECT * FROM users
INNER JOIN tags as t1 on t1.user_id = users.id and t1.name='tag1'
INNER JOIN tags as t2 on t2.user_id = users.id and t2.name='tag2'
squillman
  • 13,363
  • 3
  • 41
  • 60
  • Yours and @bdukes are the only really usable answers IMO. Those `COUNT` and `HAVING` seem like overcoding to me... – rsenna Jan 21 '11 at 20:15
  • Thanks for your help. I agree, with multiple tags this could get messy, but according to the post rsenna linked to, joins are much faster than taking the aggregate route. But really creative solution. – Binary Logic Jan 21 '11 at 21:02
  • 1
    @bin See my comment against rsenna's further down. This is `not` better. – RichardTheKiwi Jan 22 '11 at 03:00
  • Probably `tags` should have `PRIMARY KEY(user_id, name)`. – Rick James Jan 25 '21 at 00:48
6

I would do exactly what you are doing first, because that gets a list of all users with 'tag1' and a list of all users with 'tag2', but in the same response obviously. So, we have to add some more:

Do a group by users (or users.id) and then having count(*) == 2. That will group duplicate users (which means those with both tag1 and tag2) and then the having-part will remove the ones with just one of the two tags.

This solution avoids adding yet another join-statement, but honestly I'm not sure which is faster. People, feel free to comment on the performance-part :)

EDIT: Just to make it easier to try out, here's the whole thing:

SELECT * 
FROM users INNER JOIN 
     tags ON tags.user_id = users.id 
WHERE tags.name = 'tag1' OR tags.name = 'tag2'
GROUP BY users.id
HAVING COUNT(*) = 2
Jakob
  • 24,154
  • 8
  • 46
  • 57
  • 2
    +1 because this scales well for higher number of tag names as well. – Péter Török Jan 21 '11 at 20:01
  • Sorry, but aggregates are notoriously slow. Unfortunately I have found just an example using MySQL [here](http://stackoverflow.com/questions/477006/sql-statement-join-vs-group-by-and-having/477013#477013), but it is almost exactly the same case, and the aggregate version is almost 20 times **slower** than the version with one more join. Still, your answer is a very smart one - I just would never use it. – rsenna Jan 21 '11 at 20:30
  • 1
    @rsenna - if you go further in your link to http://www.cforcoding.com/2009/07/oracle-vs-mysql-vs-sql-server.html and the `better` test, you will see that HAVING performs better in MySQL. Then again, performance ALWAYS depends on the distribution of the data. If you wrote the IN clause as multiple OR clauses, MySQL won't get confused and `HAVING` should come up faster consistently. – RichardTheKiwi Jan 21 '11 at 21:41
  • 1
    @rsenna - it only makes sense because it takes the optimizer ONE pass across the index to pick out 1,2 and 3, then it only has to aggregate ONCE. With your join, the index for 1 DOES NOT contain any information about 2,3, so it has to `go back` and retrieve the set for 2, then hash match - so I don't see how it `decreases` the set. Maybe after the hash/merge, but not `before`. – RichardTheKiwi Jan 21 '11 at 21:43
  • @Jakob. See my answer. @Peter. Er, it doesn't "scale" at all ! – PerformanceDBA Jan 21 '11 at 22:50
  • @cyberkiwi. That's a bit low level to worry about at this stage. First get it right (which has not been achieved yet), then improve the performance. The reverse sequence causes various problems. – PerformanceDBA Jan 21 '11 at 23:11
  • @PerformanceDBA: HAHAHAHA! "First get it right". So we should use your invalid solution then? :) Mine, squillman's and ryanprayogo's all work. It has been achieved! Your answer is not adding anything but confusion to this discussion. – Jakob Jan 22 '11 at 10:13
  • @Jakob. You are displaying your immaturity again on a technical website. Wipe your nose; drink some tea; read my comments; then read my code again. Slowly. Post questions, not statements. Then read again, slowly. – PerformanceDBA Jan 22 '11 at 12:29
  • @Performance: I'm sorry for the harsh response. Sometimes I get a little upset (always makes me think of http://xkcd.com/386). I'd prefer an nice and open discussion about the actual solution to this problem. I'll slow down. If you'd like to do something to improve the discusion as well, I'd urge you to try out out your examples before posting. – Jakob Jan 22 '11 at 12:33
  • @Jakob. You know, I empathise. Anyone who would give up a roll in the hay to prove themselves Right is a sad case. The only thing sadder is someone who is Wrong, who is trying to prove themselves Right. Thank God I can accept being wrong. That is why it is better to ask questions, than to make statements on a public website. If you have read my explanations, and still do not understand, if point out that bit (**I** do not know what it is!), I would be pleased to provide even more explanation. Why do you think Binary accepted my answer, if it doesn't work ?!?!? **Find out why**. – PerformanceDBA Jan 22 '11 at 13:06
  • I don't think he have verified that it returns the right results. He probably has no test case covering it so to speak. And you obviously haven't tested it either. Why do you think all people except you and the guy who asked the question (who obviously don't know the answer; he asked for it) are telling you that **this solution doesn't work**? – Jakob Jan 22 '11 at 13:13
  • 1) *who obviously don't know the answer; he asked for it* Yes, and because it is a straight, simple coding question, that is precisely why he *will* know if it is right. 2) This is not the first time on SO, that would-be coders do not understand the code of experienced coders. 3) Read my **amendment** 4) Your "test" does not show results ! – PerformanceDBA Jan 22 '11 at 15:16
4

Ok, stating the problem again.

"Find users that have entries in the tags table for both tag1 and tag2". This means at least 2 rows in the child tags table for each user table entry

Solution 1: The intersection of "users with tag1" and "users with tag2"

SELECT u.*
FROM 
    users u INNER JOIN 
    (
    SELECT user_id FROM tags WHERE name = 'tag1'
    INTERSECT
    SELECT user_id FROM tags WHERE name = 'tag2'
    ) t ON u.id = t.user_id

Solution 2: EXISTS

SELECT u.*
FROM 
    users u
WHERE 
    EXISTS (SELECT * FROM tags t1 WHERE t1.name = 'tag1'
                     AND u.id = t1.user_id)
    AND
    EXISTS (SELECT * FROM tags t2 WHERE t2.name = 'tag2'
                     AND u.id = t2.user_id)

Solution 3: JOIN

SELECT u.* FROM
   users u
   INNER JOIN
   tags as t1 on t1.user_id = u.id
   INNER JOIN
   tags as t2 on t2.user_id = u.id 
WHERE
   t1.name='tag1' AND t2.name='tag2'

Solution 4: IN

SELECT u.*
FROM 
    users u
WHERE 
    u.id (SELECT t1.user_id FROM tags t1 WHERE t1.name = 'tag1')
    AND
    u.id (SELECT t2.user_id FROM tags t2 WHERE t2.name = 'tag2')

All The EXISTS, INTERSECT and IN should give the same execution plan in SQL Server

Now, these are all for the case where you are looking for 2 tags. As you want more tags, they become cumbersome so use shahkalpesh's solution.

However, I'd modify it so the tags are in a table and no extra OR clauses are needed

SELECT u.*
FROM
    Users u
    Inner join
    tags t ON t.user_id = u.id
    JOIN
    @MyTags mt ON t.name = mt.name
GROUP BY u.*
HAVING count(tags.*) = COUNT(DISTINCT mt.name)
gbn
  • 422,506
  • 82
  • 585
  • 676
2
SELECT Users.id, count(tags.*) as tagCount
FROM Users Inner join tags
ON tags.user_id = users.id
WHERE tags.name='tag1' OR tags.name='tag2'
GROUP BY Users.id
HAVING count(tags.*) = 2
shahkalpesh
  • 33,172
  • 3
  • 63
  • 88
  • I marked this up as it is the closest to being correct. But it should really `COUNT(DISTINCT tags.name)`. This scales by OR-ing more and increasing the HAVING test from 2 to 3,4 etc – RichardTheKiwi Jan 21 '11 at 20:26
1

Try the following:

SELECT * 
FROM users u, tags t1, tags t2
WHERE t1.user_id = t2.user_id
AND t1.name = 'tag1'
AND t2.name = 'tag2'
AND t1.user_id = u.id

Obviously, for a large number of tags, the performance of this query will be severely degraded.

ryanprayogo
  • 11,587
  • 11
  • 51
  • 66
  • `t1.user_id = t2.user_id` and then ` t1.user_id = u.id` smells bad too. – rsenna Jan 21 '11 at 20:10
  • @rsenna: I'm curious as to what's bad about that? – ryanprayogo Jan 21 '11 at 20:11
  • it's ANSI-89 syntax not ANSI-92 - you should try to use the most recent standards... – Leslie Jan 21 '11 at 20:17
  • @ryanprayogo: It "smells" because it just confuses things - you should not join two dependent tables between themselves, if the real FK is between the main and the dependent table. So, use `t1.user_id = u.id` and `t2.user_id = u.id` - that makes clear what the real relationships are, and works the same way. – rsenna Jan 21 '11 at 20:20
  • 1
    +1 because there is nothing wrong with the old syntax (the new syntax is confusing to beginners). +1 because this is the best of the Answers so far. – PerformanceDBA Jan 21 '11 at 20:35
  • 2
    @PerformanceDBA: I find the new syntax confusing to old timers, not beginners - but your experience may vary. About the best answers - just look at @squillman's, it's the same (but uses the new syntax and it makes sense for a change). – rsenna Jan 21 '11 at 20:40
  • @rsenna: what part of the syntax are you referring to and how would you change it? or are you just referring to the lack of explicitly stating the join as HLGEM mentions? – Stephen P Jan 21 '11 at 21:07
  • @rsenna. They may be more occurring in the world that what you find. squillman and ryanprayogo are both correct. Read my answer for other details. – PerformanceDBA Jan 21 '11 at 23:16
1
select * from users u
where 2 = (select count(*) from tags t where t.user_id = u.id and name in ('tag1','tag2'))

Assuming that any given tag can only be present once per user.

Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281
1

Give this a try

SELECT *
FROM users
INNER JOIN tags ON tags.user_id = users.id
WHERE users.id in
    (
    SELECT user_id
    FROM tags
    WHERE name IN ('tag1', 'tag2')
    GROUP BY user_id
    HAVING COUNT(*) = 2
    )
bobs
  • 21,844
  • 12
  • 67
  • 78
0

You'll need to check for the existence of two rows, rather than being able to do a simple IN (which will only check the values within each joined record). Maybe something like:

SELECT * 
from users
WHERE EXISTS (SELECT NULL FROM tags WHERE tags.user_id = users.id AND tags.name = 'tag1')
  AND EXISTS (SELECT NULL FROM tags WHERE tags.user_id = users.id AND tags.name = 'tag2');
bdukes
  • 152,002
  • 23
  • 148
  • 175
  • @bdukes. The use of Null as an indicator is reverse to the logic/intent and it does not work on all platforms, always use "1" such cases. – PerformanceDBA Jan 21 '11 at 23:18
  • @PerformanceDBA: I would also use "1" in this case, BUT I still think you're just being a little bit stubborn here. 1) Why is it "reverse to the logic/intent"? It's an EXISTS, we couldn't care **less** about what's being returned by the inner select clause! Hence, the use of `NULL` is completely acceptable. – rsenna Jan 24 '11 at 15:44
  • @PerformanceDBA: (cont.) 2) Which platform does it not work with? [SQL Server](http://bradsruminations.blogspot.com/2009/09/age-old-select-vs-select-1-debate.html), [Oracle](http://searchoracle.techtarget.com/answer/How-does-WHERE-EXISTS-SELECT-NULL-work), MySQL and DB2 all accept it at least. And in those platforms, under current DBMS versions, `SELECT NULL` works exactly the same way as `SELECT 1` for `EXISTS` clauses - the only real problem is with `... EXISTS (SELECT * ...`, and even that "issue" is usually exaggerated. – rsenna Jan 24 '11 at 15:45
  • @rsenna. Keep the personal opinions and name-calling to yourself. It does **not** work on DB2, have you tried it ? What version ? Are you aware of the Standard definition of Null ? Existence is a known fact, Null is the Unknown, a non-fact. Do I have to explain "reverse logic" to you ? – PerformanceDBA Jan 25 '11 at 12:47
  • 1
    @PerformanceDBA: Uh, yes, in fact I did try it. I'll repeat: it doesn't really **matter** if it is `SELECT 1` or `SELECT NULL` for `EXISTS` clauses. Because NO select is being executed after all... If you need proof, see [here](http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2.doc.sqlref/x1011f5.htm). You'll see that the recommended `EXISTS` usage in DB2 is `SELECT *` - that would take a few more nanoseconds to compile in SQL Server, but DB2 does not even have this issue, it just ignores the SELECT columns... – rsenna Jan 25 '11 at 13:19
0

What about

SELECT * FROM users, tags WHERE tags.user_id = users.user_id AND tags.name = 'tag1'
INTERSECT
SELECT * FROM users, tags WHERE tags.user_id = users.user_id AND tags.name = 'tag2'
TMN
  • 3,060
  • 21
  • 23
-2

Try WHERE tags.name IN ('tag1') and tags.name IN ('tag2');

Not super efficient, but probably one of many ways.

Josh
  • 10,352
  • 12
  • 58
  • 109
  • 1
    How would you expect anything to be in both those groups? `tags.name` is a string, not some kind of array. It cannot both be 'tag1' and 'tag2' :) – Jakob Jan 21 '11 at 19:58