0

Can anyone tell me ways to do this kind of search in a database?

I got these tables:

posts (id, tags_cache)
tags (id, name)
posts_tags (post_id, tag_id)

The user enters a search query (say "water blue") and I want to show the posts that have both tags. The only way I can think of to search is using FIND_IN_SET, this way:

SELECT p.*, GROUP_CONCAT(t.name) AS tags_search
FROM posts p
LEFT JOIN posts_tags pt ON p.id = pt.post_id
LEFT JOIN tags t ON pt.tag_id = t.id
GROUP BY p.id
HAVING FIND_IN_SET('water', tags_search) > 0
AND FIND_IN_SET('blue', tags_search) > 0

The posts.tags_cache text column stores the names and id of the tags it belongs to (this way: water:15 blue:20).

To avoid JOINs by using this column for search, I've tried LIKE and INSTR but these will give inexact results since you can search for "ter" and you'll gets posts tagged 'water' and 'termal' for example. I've also tried REGEXP which gives exact results, but it's a slow process.

I can't use MATCH as tables use InnoDB.

So... is or are there other ways to accomplish this?

[Edit]

I forgot to mention that the user could search for many tags (not just 2), and even exclude tags: search posts tagged 'water' but not 'blue'. With FIND_IN_SET this works for me:

HAVING FIND_IN_SET('water', tags_search) > 0
AND NOT FIND_IN_SET('blue', tags_search) > 0

[Edit2]

I did some performance test (i.e. only checked how long the queries took, cached) as ypercube suggested, and these are the results:

muists | Bill K | ypercu | includes:excludes
--------------------------
0.0137 | 0.0009 | 0.0029 | 2:0
0.0096 | 0.0081 | 0.0033 | 2:1
0.0111 | 0.0174 | 0.0033 | 2:2
0.0281 | 0.0081 | 0.0025 | 5:1
0.0014 | 0.0013 | 0.0015 | 0:2

I don't know if this info is valid resource... But it shows that ypercube's method with a JOIN per tag is the quickest.

Parziphal
  • 6,222
  • 4
  • 34
  • 36
  • 3
    "To avoid JOINs" ... Why are you trying to avoid joins? – Mark Byers Nov 12 '11 at 18:25
  • Well, I read somewhere that using many JOINs isn't too good... – Parziphal Nov 12 '11 at 19:17
  • 3
    Do you remember where you read that? Try to avoid reading there in future. – Mark Byers Nov 12 '11 at 19:28
  • 2
    In SQL, a JOIN is just a logical programming construct. Saying JOINs are not good is like saying a `while` loop in not good. Obviously both can be misused, but they can also be the best solution for a given task. – Bill Karwin Nov 12 '11 at 20:57
  • I read that over 7 JOINs is lethal for a query (I once made a query with 14 JOINs). But if you sirs say that it's OK, then alright! – Parziphal Nov 12 '11 at 21:03

3 Answers3

4

I don't understand why you don't want to use JOINs nor why you're trying to use LEFT JOINs. You're looking for things that are there (rather than might be there) so get rid of the LEFT JOINs and just JOIN. And get rid of the tags_cache column, you're only asking for trouble with that sort of thing.

Something like this is what you're looking for:

select p.id
from posts p
join posts_tags pt on p.id = pt.post_id
join tags t on pt.tag_id = t.id
where t.name in ('water', 'blue')
group by p.id
having count(t.id) = 2

The 2 in the HAVING clause is the number of tags you're looking for.

And if you want to exclude certain tags, you could just add that to the WHERE clause like this:

select p.id
from posts p
join posts_tags pt on p.id = pt.post_id
join tags t on pt.tag_id = t.id
where t.name in ('water', 'blue')
  and p.id not in (
    select pt.post_id
    from posts_tags pt
    join tags t on pt.tag_id = t.id
    where t.name in ('pancakes', 'eggs') -- Exclude these
)
group by p.id
having count(t.id) = 2
mu is too short
  • 426,620
  • 70
  • 833
  • 800
3

Finding posts that match all of several conditions on different rows is a common problem.

Here are two ways to do it:

SELECT p.*
FROM posts p
INNER JOIN posts_tags pt ON p.id = pt.post_id
INNER JOIN tags t ON pt.tag_id = t.id
WHERE t.name IN ('water', 'blue')
GROUP BY p.id
HAVING COUNT(DISTINCT t.name) = 2;

Or:

SELECT p.*
FROM posts p
INNER JOIN posts_tags pt1 ON p.id = pt1.post_id
INNER JOIN tags t1 ON pt1.tag_id = t1.id
INNER JOIN posts_tags pt2 ON p.id = pt2.post_id
INNER JOIN tags t2 ON pt2.tag_id = t2.id
WHERE (t1.name, t2.name) = ('water', 'blue');

Re comment and edit:

The problem with the HAVING solution is that it must perform a table-scan, searching every row in the tables. This is often much slower than a JOIN (when you have appropriate indexes).

To support tag exclusion conditions, here's how I'd write it:

SELECT p.*
FROM posts p
INNER JOIN posts_tags pt1 ON p.id = pt1.post_id
INNER JOIN tags t1 ON pt1.tag_id = t1.id AND t1.name = 'water'
LEFT OUTER JOIN (posts_tags pt2 
INNER JOIN tags t2 ON pt2.tag_id = t2.id AND t2.name = 'blue')
  ON p.id = pt2.post_id
WHERE t2.id IS NULL;

Avoiding using JOINs because you read it somewhere that they are bad is senseless. You must understand that a JOIN is a basic operation in relational databases, and you should use it where the job calls for it.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • The first option is simmilar to @mu-is-too-short's, but what if excluding a tag? – Parziphal Nov 12 '11 at 19:24
  • 2
    @renocor: See also this excellent answer on a variety of ways to accomplish this - plus performance tests (it's for PostgreSQL but MySQL would have similar results) that show that `JOIN` is probably faster: http://stackoverflow.com/questions/7364969/how-to-filter-sql-results-in-a-has-many-through-relation – ypercubeᵀᴹ Nov 12 '11 at 19:24
1

For your additional request, excluding some tags, you could use the next approach. It will give you all posts that have both water and blue tags but neither black, white or red:

SELECT p.*
FROM posts p
  INNER JOIN posts_tags pt1 ON p.id = pt1.post_id 
  INNER JOIN tags t1 ON pt1.tag_id = t1.id
  INNER JOIN posts_tags pt2 ON p.id = pt2.post_id
  INNER JOIN tags t2 ON pt2.tag_id = t2.id
WHERE (t1.name, t2.name) = ('water', 'blue')          --- include
  AND NOT EXISTS
      ( SELECT *
        FROM posts_tags pt
          INNER JOIN tags t ON pt.tag_id = t.id
        WHERE p.id = pt.post_id 
          AND t.name IN ('black', 'white', 'red')     --- exclude
      )
ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235