Finding out the total count of each tag in a data set

Question

I have a set of Item-s and each of them has a set of Tag-s. I want to have a DB SELECT which, for some (large) subset of these items, returns the total count of each Tag in that Item sub-set.

Is there a way to do this using the PostgreSQL 9.3 / 9.4 array operators?

My plan B is to have a separate table Tags and many-to-many link table Item_Tags, and then do a:

CREATE TABLE "Tags" (
  "Name" character varying,
  id uuid NOT NULL,
  CONSTRAINT id PRIMARY KEY (id)
);

CREATE TABLE "Items" (
  id uuid NOT NULL,
  data character varying,
  CONSTRAINT "Items_pkey" PRIMARY KEY (id)
);

CREATE TABLE "Item_Tags" (
  tag_id uuid,
  item_id uuid,
  id uuid NOT NULL,
  CONSTRAINT "Item_Tags_pkey" PRIMARY KEY (id),
  CONSTRAINT "Item_Tags_item_id_fkey" FOREIGN KEY (item_id)
      REFERENCES "Items" (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT "Item_Tags_tag_id_fkey" FOREIGN KEY (tag_id)
  REFERENCES "Tags" (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION
);

Select "Tags"."Name", count(*)
From "Tags"
Join "Item_Tags" on "Tags"."id" = "Item_Tags"."tag_id"
Join "Items" on "Items"."id" = "Item_Tags"."item_id"
Where "Items"."data" in ('a', 'b', 'c', 'd', 'e') -- replace with actual criteria
Group By "Tags"."Name"

Is there a better way?

Are there any special indices to use which would help make this more efficient, assuming both the Items and Tags tables are large (hundreds of millions and millions items, respectively)?

If I want counts of all tags (without filtering), should I just create a view and use that?

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

DB schema

Your plan B is the generally superior approach (exceptions apply). But your implementation doesn't look good. Don't use non-descriptive identifiers like "id" or "name", don't use double-quoted mixed-case identifiers, etc. Consider this related answer with a "best-practice" code example:

How to implement a many-to-many relationship in PostgreSQL?

Also, don't use UUID columns if you don't have to. The data type bigint (use a bigserial for your pk columns!) easily covers "hundreds of millions" of rows and is much faster and smaller on disk.

Query

With a clean implementation, a (substantially faster) query could look like this:

For a small percentage of rows:

SELECT tag_id, t.tag, c.ct
FROM  (
   SELECT it.tag_id, count(*) AS ct
   FROM   item     i
   JOIN   item_tag it USING (item_id)
   WHERE  i.data = ANY ('{a,b,c,d,e}')    -- ANY is shorter for a long list
   GROUP  BY 1
   ) c
JOIN   tag t USING (tag_id);

For all tags (don't join to table item at all):

SELECT tag_id, t.tag, c.ct
FROM  (
   SELECT tag_id, count(*) AS ct
   FROM   item_tag
   GROUP  BY 1
   ) c
JOIN   tag t USING (tag_id);

Especially for lots of rows, it pays to aggregate before you join:

Why does the following join increase the query time significantly?

Answers to appended questions

Special indices may be possible for special cases - which you would have to define clearly. In particular, partial indices might come in handy.
A view is not going to help performance at all. It's just a "saved query".
A materialized view could be very useful for read-only tables or if you do not need to include the latest changes in every result. Introduced with Postgres 9.3. Expect further improvements from the upcoming Postgres 9.4.
If you want counts for all tags indexes are not going to help performance at all, since Postgres will run sequential scans when most or all of the table is involved.

Finding out the total count of each tag in a data set

1 Answers1

DB schema

Query

Answers to appended questions