2

I'm not a database programmer, but I have a simple database-backed app where I have items with tags. Each item may have multiple tags, so I'm using a typical junction table (like this), where each row represents the fact that the item with the appropriate ID has the tag with the appropriate ID.

This works very logically when I want to do something like select all items with a given tag.

But, what is the typical pattern for doing AND searches? That is, what if I want to find all items which have all of a certain set of tags? This is such a common operation that I'd think some of the intro tutorials would cover it, but I guess I'm not looking in the right places.

The approach I tried was to use INTERSECT, first directly and then with subqueries and IN. This works, but builds up long-seeming queries quickly as I add search terms. And, crucially, this approach appears to be about an order of magnitude slower than the approach of shoving all the tags as text into one "tags" column and using SQLite's full-text search. (And, as I would expect/hope, the FTS search gets faster as I add more terms, which doesn't seem to be the case with the INTERSECTS approach.)

What's the proper design pattern here, and what's the right way to make it snappy? I'm using SQLite in this case, but I'm most interested in a general answer, since this must be a common thing to do.

Community
  • 1
  • 1
mattdm
  • 2,082
  • 25
  • 39

2 Answers2

2

The following is the standard ANSI SQL solution which avoids synchronizing the number of ids and the ids themselves.

with tag_ids (tid) as (
   values (1), (2)
)
select id
from tags
where id (select tid from tag_ids)
having count(*) = (select count(*) from tag_ids);

The values clause ("row constructor") is supported by PostgreSQL and DB2. For database that don't support that, you can replace it with a simple "select", e.g. in Oracle this would be:

with tag_ids (tid) as (
   select 1 as tid from dual
   union all 
   select 2 from dual
)
select id
from tags
where id (select tid from tag_ids)
having count(*) = (select count(*) from tag_ids);

For SQL Server you would simply leave out the "from dual", as it does not require a FROM clause for a SELECT.

This assumes that one tag can only be assigned exactly once. If that isn't the case, you would need to use a count(distinct id) in the having clause.

1

I would be inclined to use a group by:

select id
from tags
where id in (<tag1>, <tag2>)
group by id
having count(*) = 2

This would guarantee that both appear.

For an unlimited size list, you could store the ids in a string, such as '|tag1|tag2|tag3|' (note delimiters on ends). Then you can do:

select id
from tags
where @taglist like '%|'+tag+'|%'
group by id
having count(*) = len(@taglist) - (len(replace(@taglist, '|', '') - 1)

This is using SQL Server syntax. But, it is saying two things. The WHERE clause is saying that the tag is in the list. The HAVING clause is saying that the number of matches equals the length of the list. It does this with a trick, by counting the number of separtors and subtracting 1.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • A character column does not really qualify as an "unlimited size list". Storing delimited values isn't such a good idea. –  Aug 20 '12 at 21:34
  • @a_horse_with_no_name . . . I'm not quite sure how to respond. In SQL Server, a character string can have 1 billion+ characters. Not quite unlimited, but big enough. Although storing delimited values is not usually a good idea, this is about passing a list into a query statement. Putting the values in a character string may be a good way of doing that from an application. – Gordon Linoff Aug 20 '12 at 21:41
  • If taglist is a variable then this makes much more sense. I thought you wanted to store all assigned tag ids in one big character column (which would not be a a good idea) –  Aug 20 '12 at 21:45
  • @a_horse_with_no_name . . . I suspected a misunderstanding like this, and changed the code to have a variable. Originally, I used a more generic notation, only because the question did not (and does not) mention a particular database. – Gordon Linoff Aug 20 '12 at 21:52
  • The database happens to be SQLite (which I did mention in passing) but I'm looking for a general answer. – mattdm Aug 21 '12 at 04:15