2

There's two tables:

  1. users
  2. documents

users:

CREATE TABLE IF NOT EXISTS users (
    id         SERIAL PRIMARY KEY,
    username   TEXT NOT NULL UNIQUE
)

documents:

CREATE TABLE IF NOT EXISTS documents (
    id         SERIAL PRIMARY KEY,
    user_id    INT NOT NULL REFERENCES users,
    name       TEXT NOT NULL,
    value      INT NOT NULL 
)

I want to filter users by document name and value. Usually, between 2-5 document name and value is filtered. And there are apprx 6-10 documents per user.

I have a huge database and want to improve this query. I think I can get faster query without the HAVING clause. Any help highly appreciated. I use PostgreSQL 13.

The query I am using:

SELECT
    users.username,
    jsonb_agg(jsonb_strip_nulls(jsonb_build_object('name', documents.name, 'value', documents.value))) AS docs
FROM
    users
JOIN
    documents
ON
    users.id = documents.user_id
GROUP BY
    users.username
HAVING
    jsonb_agg(jsonb_build_object('name', documents.name, 'value', documents.value)) @? '$[*] ? (@.name == "doc1") ? (@.value == "2")'
Ulvi
  • 965
  • 12
  • 31

1 Answers1

4

For big tables it's extremely expensive to join and aggregate all rows before finally filtering the few qualifying rows.

It should be faster by orders of magnitude to filter qualifying documents first, then grab all documents for the same user(s), aggregate, and finally join to users:

SELECT u.username, d.docs
FROM  (
   SELECT user_id, jsonb_agg(jsonb_build_object('name', d.name, 'value', d.value)) AS docs
   FROM   documents d1
   JOIN   documents d USING (user_id)
   WHERE  d1.name = 'doc1'
   AND    d1.value = 2
   -- AND    d.name  IS NOT NULL  -- strip NULLs early 
   -- AND    d.value IS NOT NULL  -- if not defined NOT NULL anyway
   GROUP  BY 1
   ) d
JOIN   users u ON u.id = d.user_id;

While being at it, I removed jsonb_strip_nulls() since all processed columns are defined NOT NULL anyway. Also cheaper.

Possibly simplify to just jsonb_build_object(d.name, d.value).

For the first step, an index on documents(name, value) will help a lot. Maybe even on documents(name, value, user_id) to get index-only scans (depends).

It should be safe to assume there is also an index on documents(user_id). Helps the next step. Again, documents(user_id, name, value) for index-only scans.

And, finally, an index on users(id). Should be a given. Again, users(id, username) for index-only scans.

If (name, value) is not UNIQUE per user (like it seems to be the case), use EXISTS instead to avoid duplicates:

SELECT u.username, d.docs
FROM  (
   SELECT user_id, jsonb_agg(jsonb_build_object('name', d.name, 'value', d.value)) AS docs
   FROM   documents d
   WHERE  EXISTS (
      SELECT FROM documents d1
      WHERE  d1.user_id = d.user_id
      AND    d1.name = 'doc1'
      AND    d1.value = 2
      )
   GROUP  BY 1
   ) d
JOIN   users u ON u.id = d.user_id;

Similar query plan, same indexes can be used.

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • That's better! But I get duplicate objects when I don't make any filtering. I tried to solve this but didn't succeed. Do you know a way to do it? I will use it in backend of a website where filtering is dynamically added. – Ulvi Jan 17 '22 at 16:07
  • And also, it is interesting, why does it generate same objects multiple times when filters are not added? – Ulvi Jan 17 '22 at 16:20
  • 1
    @Ulvi: The first query generate duplicates if `(name, value)` is not unique per user. I added an alternative to avoid duplicates, and some more to adjust to your updated table definition. – Erwin Brandstetter Jan 17 '22 at 16:35
  • @ErwinBrandstetter Hi Erwin, allow me. I didn't get the second query, not sure why do self join. I think self join is unnecessary in this context.. – jian Dec 21 '22 at 10:55
  • @jian The semi-join with `EXISTS` in the 2nd query is necessary. – Erwin Brandstetter Dec 21 '22 at 17:44