For each array element, count rows whose array column contains it

Question

In table users, there is a column user_ids varchar[] with GIN index. Each user_id of column user_ids in each row is unique.

I want to count the number of rows that contain user_id for each user_id in an input list.

I try using this query but EXPLAIN shows it doesn't use GIN index:

SELECT user_id, count(*) + 1 COUNT
FROM (
    SELECT unnest(user_ids) user_id
    FROM users 
    WHERE user_ids && ARRAY['1', '2', '3']::varchar[]
) sub
WHERE user_id = ANY(ARRAY['1', '2', '3']::varchar[])
GROUP BY user_id;

Is there any way to make use of GIN index here?

To help you with your [tag:query-optimization] question, we need to see your table definitions, your index definitions, and the output of EXPLAIN. Please [read this](https://stackoverflow.com/tags/query-optimization/info), then [edit] your question. — O. Jones, May 11 '23 at 12:08

Erwin Brandstetter · Accepted Answer · 2023-05-11T12:56:34.117

1

For small input arrays and limited overlap between rows, this query should be as fast as it gets:

SELECT i.user_id, u.*
FROM   unnest('{1, 2, 3}'::varchar[]) i(user_id)  -- your input here
CROSS  JOIN LATERAL (
   SELECT count(*) AS count
   FROM   users u
   WHERE  u.user_ids @> ARRAY[i.user_id]  -- !
   ) sub;

Notable difference: my query includes elements of the input with 0 matches, your original does not.

The array operator @> expects array types as left and right operand (even if it's just a single element to the right like in this case).

Array operators can use a GIN index on (user_ids). The subquery in your original query also qualifies for index usage!

See:

Why does my query plan not use the GIN index?

Maybe Postgres decides for a different query plan because the filter is not selective enough, or column statistics are outdated?

To determine whether the index is applicable at all, run a test with this local setting (only in your session!):

SET enable_seqscan = off;

EXPLAIN ....

This forces a plan without sequential scan if at all possible.

edited May 11 '23 at 12:56

answered May 11 '23 at 12:32

Erwin Brandstetter

605,456
145
1,078
1,228

If that does not clear things up, provide the result of `EXPLAIN (ANALYZE, BUFFERS, SETTINGS) SELECT ...`, an exact `CREATE TABLE` statement, cardinalities, your Postgres version, some sample rows, and sample input in the question. – Erwin Brandstetter May 11 '23 at 12:54
We had a similar situation a while back and using the `ANY` construction turned out to be much faster than the `@>` one... YMMV, things are never B&W in SQL =) – deroby May 11 '23 at 19:27
@deroby The `ANY` construct is often faster when not involving an index. For the case at hand, if the table isn't trivially small and the input is selective, a GIN index typically beats the rest (by far). – Erwin Brandstetter May 11 '23 at 21:47
1

Postgres uses index when using your query or `SET enable_seqscan = off;` with some improvement in performance. But exactly as you predicted, when input array is larger (around 100 elements in my case), seq scan brings better result. Your answer is very helpful and enlightening. Thanks – eee May 11 '23 at 23:36

For each array element, count rows whose array column contains it

1 Answers1

Why does my query plan not use the GIN index?