3

In a product_tag table, the columns are

id, product_id, tag_id

If I would like to search for a product that is tag1 OR tag2 OR tag3, the direct way is:

SELECT DISTINCT productId FROM product_tags WHERE tagId IN (2,4);

If I would like to search for a product that is tag1 AND tag2 AND tag3, the direct way is:

SELECT productId FROM product_tag WHERE tag_id IN (tag1, tag2, tag3) GROUP BY productId HAVING COUNT(*) = 3

But the question is if I would like to search a product that has a complex tag relationship, such as:

product that is (tag1 OR tag2 OR tag3) AND (tag 4 OR tag5 OR tag 6) AND (tag 7 OR tag8 OR tag9)

What is the SQL expression with best performance? (and preferably elegant).

Edit:
The most important performance gain was to add indexes, as Remus in the comments recommended.

TiansHUo
  • 8,509
  • 7
  • 45
  • 57
  • 2
    The performance is not going to come from your SQL text, but from your indexes. – Remus Rusanu Apr 21 '11 at 16:38
  • How should I index this? – TiansHUo Apr 22 '11 at 09:21
  • 1
    @Remus, I used `CREATE INDEX "main"."product_category_fastindex" ON "product_tag" ("product_id" ASC, "tag_id" ASC)` and saw a >10x performance gain, thanks a lot!! – TiansHUo Apr 22 '11 at 09:32
  • 2
    You should also try adding a second index with the `tag_id` the leftmost column: `CREATE INDEX main.product_tag_fastindex ON products_tag (tag_id ASC, product_id ASC)`. This way the optimizer can choose between the two indexes and you may see even faster results. – Remus Rusanu Apr 22 '11 at 14:49
  • @Remus, this is good advice. For certain tag distributions (e.g. many types of different tags, small number of rows for each), going through a tags index and just getting the common rows might be faster. – Stephen Chung Apr 26 '11 at 11:12
  • @Remus, I added a second index, and the performance gain didn't make too much a difference since I benchmarked it with demo data(which wasn't large enough), maybe it would manifest when more data comes in. Thanks! – TiansHUo Apr 28 '11 at 11:50

5 Answers5

1

You really can't do this directly with a set-based language such as SQL.

Your simple "AND" version also won't work unless you have no duplicates of (productId,tagId).

For your complex relationship, it would be necessary to break your query apart into several subqueries. First break along all "AND" clauses:

WHERE tag_id IN (tag1, tag2, tag3)
WHERE tag_id IN (tag4, tag5, tag6)
WHERE tag_id IN (tag7, tag8, tag9)

Then do an INTERSECTION of the query results.

If any of these subqueries is not a simply OR'ed list but in turn contain AND's in a more complex logic structure, you need to further break down these subqueries recursively.

In other words, you can recursively break down the logic tree along "AND" clauses, then at each tree level do an INTERSECT of the query results.

Doing this is likely to be much faster than generating a huge SQL that will return the result in one go -- because each of the simple OR'ed list can leverage an index you have on tag_id.

Stephen Chung
  • 14,497
  • 1
  • 35
  • 48
  • I disagree with using INTERSECTION, that would be tremendously slow if I am not mistaken. – TiansHUo Apr 21 '11 at 15:17
  • 1
    Not slower than a full table scan when the engine is forced to evaluate a complex logic on each row. If each subquery only selects a very small subset of the rows, and those rows can be extracted quickly via an index, then the INTERSECT will be much faster. – Stephen Chung Apr 22 '11 at 04:15
1

Union all 3 groups. They're 3 selects, but they're really simple ones.

DanMan
  • 11,323
  • 4
  • 40
  • 61
Curtis
  • 675
  • 10
  • 28
0

I noticed Select values that meet different conditions on different rows?

How about

SELECT DISTINCT t1.productId FROM product_tags t1
JOIN product_tags t2 ON t1.productId=t2.productId AND t2.tagId IN (tag4,tag5,tag6)
JOIN product_tags t3 ON t1.productId=t3.productId AND t3.tagId IN (tag7, tag8, tag9)
AND t1.tagId IN (tag1,tag2,tag3)

It would be even better if DISTINCT could be removed somehow.

Community
  • 1
  • 1
TiansHUo
  • 8,509
  • 7
  • 45
  • 57
  • This can work if you are only considering one level of AND's. The query optimizer should essentially turn it into three subqueries and then further process, otherwise it will run much slower chasing the primary keys. – Stephen Chung Apr 22 '11 at 04:18
  • This is the version I finally used, and with indexes, it is fast. The SQL is also very easy to generate. – TiansHUo Apr 26 '11 at 10:33
  • Then you're running on a DB engine with a very good query optimizer -- one that takes special consideration for self-joins on the primary key and optimize away multiple-row accesses. Joins are quite slow on many engines, esp. embedded ones, that you always want to limit the number of joins and restrict the join space as much as possible. – Stephen Chung Apr 26 '11 at 11:09
0

The performance wouldn't be that great but you could do a nested query

SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag1, tag2, tag3)
AND ProductID IN (
SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag4, tag5, tag6)
)
AND ProductID IN (
SELECT 
ProductID FROM
Products 
WHERE tag_id IN (tag7, tag8, tag9)
)
Thomas Paine
  • 301
  • 3
  • 11
0

Is the number of tags known in advance? If it isn't something that will grow over time, I would change tag_id to a bitset.

WITH T AS 
 (SELECT product_id, bit_or((1<<tag_id)::bigint) tagset 
  FROM product_tag GROUP BY product_id) 
SELECT product_id 
WHERE (tagset & 7)>0 AND (tagset & 56)>0 AND (tagset & 448)>0;

I have used Postgres here, where & is known as bitwise AND; bit_or is an aggregate function (SUM would work just as well here, assuming no duplicates allowed in the product_tag table). The magic numbers in the masks are just bit_ors of powers of two. The double-colon is a Postgres cast. Everything here should be available under slightly different names elsewhere. But PG also has bitstrings of indefinite size, and the same logic with bitstrings can be implemented for a large number of tags.

By the way, the situation of matching all tags is just (tagset & mask)=mask.

This is in fact why your indices are working so fast; they are probably being merged into this type of test.

Andrew Lazarus
  • 18,205
  • 3
  • 35
  • 53
  • The number of tags are unknown, and supposedly many. Isn't this the same as making a column as boolean? – TiansHUo Apr 25 '11 at 03:36
  • Similar, basically a vector of bools. Thinking about this some more, I think I would use JOINs for the ANDs, anding a bunch of tables whose where clauses use ORs. The DB may be optimized for that sort of OR query along the lines of creating its own bitset. The trick is to replace AND in a WHERE with `SELECT X FROM Y AS Y1 JOIN Y AS Y2 ON Y1.product_id=Y2.product_id AND Y2.tag_id=something WHERE Y1.tag_id=something_else;` ... but also the Ys having disjunctive conditions. Still thinking; interesting problem. – Andrew Lazarus Apr 25 '11 at 03:47