3

I have a table that references a bunch of articles, the table contains tags for those articles. Like this:

tag text
article_id bigint

I want to select all article_ids with a set of tags, say tag1, tag2, tag3 but the article could also have tag4, tag5 as well attached to it.

I know this will work:

SELECT article_id 
FROM tag WHERE tag='tag1' 
INTERSECT 
SELECT article_id 
FROM tag 
WHERE tag='tag2' 
INTERSECT 
SELECT article_id 
FROM tag 
WHERE tag='tag3'

And so will this:

SELECT article_id 
FROM tag 
WHERE tag IN ('tag1','tag2','tag3') 
GROUP BY article_id 
HAVING count(*) = 3

But I am not sure that that is the most efficient way to do this. I have also been playing with the below, but cannot get it to work currently.

SELECT array_agg(tag) as arr,
       article_id 
FROM tag 
GROUP BY article_id 
HAVING arr = {tag1,tag2,tag3}

This just seemed like a common issue that others would encounter, I was wondering if the INTERSECT is the most efficient query in this case. It is for PostgreSQL.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
ehiller
  • 1,346
  • 17
  • 32
  • 1
    If you want to find the most efficient query, just run an `EXPLAIN ANALZYE` –  Mar 11 '12 at 23:20

3 Answers3

1

I assume you want more columns from article than just the article_id. Doesn't change the query style much, though.

What you are dealing with is called relational division. And there is quite a number of ways to skin this cat as can be seen under this related question - including advise on indexing and performance tests.

My personal favorites (and very likely among the fastest) are:

SELECT a.*
FROM   article a
JOIN   tag x USING (article_id)
JOIN   tag y USING (article_id)
JOIN   tag z USING (article_id)
WHERE  x.tag = 'tag1'
AND    y.tag = 'tag1'
AND    z.tag = 'tag3';

Or:

SELECT a.*
FROM   article a
WHERE  EXISTS (
   SELECT *
   FROM   tag x
   JOIN   tag y USING (article_id)
   JOIN   tag z USING (article_id)
   WHERE  x.article_id = a.article_id
   AND    x.tag = 'tag1'
   AND    y.tag = 'tag2'
   AND    z.tag = 'tag3'
   );

@David already explained why your third version cannot function. But you should never use it anyway because it cannot use indexes and will be slower by an order of magnitude than any of the other methods here.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

Personally I like the second option. But you should use PostgreSQL's tools, along with some test queries, to see which is the most efficient.

The reason the third doesn't work as written is because you need specify a sort order for array_agg(): how to make array_agg() work like group_concat() from mySQL

Community
  • 1
  • 1
David Faber
  • 12,277
  • 2
  • 29
  • 40
  • When rewrote it somewhat works as " SELECT article_id FROM tag GROUP BY article_id HAVING array_agg(tag::text ORDER BY tag)=ARRAY['tag1','tag2','tag3'];" but if the article has a tag4 then it won't find it. Is there a way I am missing to essentially make sure that all the items are in the array, but that that the arrays are not perfectly matching? – ehiller Mar 12 '12 at 01:42
0

In your last variant you are not interested in array equality. You are interested in array containment, HAVING arr contain all the three tags.

And array1 contains array2 is array1 @> array2 in PostgreSQL.

Still I expect Erwin's answer to win performance-wise thanks to indexing.

Michał Politowski
  • 4,288
  • 3
  • 30
  • 41