Looking in array with a big set of input values

Question

I have a table that looks like (with an example of number of rows in each to get the kind of ration):

expectedreportsnodes (1 000 000 rows):

 nodejoinkey   | integer  | not null
 nodeid        | text     | not null
 nodeconfigids | text[]   |

nodeconfigids array generally contains 1-50 values.

And a second table:

expectedreports (10 000 rows):

 pkid       | integer  | not null
 nodejoinkey| integer  | not null
 ...

I want to query for all expected reports for which their exists a entry in nodeexpectedreports with a given nodeConfigId. I have potentially large amount of nodeConfigIds (thousands).

What is most efficient way of doing that?

For now, I have:

select E.pkid, E.nodejoinkey from expectedreports E 
inner join (
  select NN.nodejoinkey, NN.nodeid, NN.nodeconfigids from (
    select N.nodejoinkey, N.nodeid, unnest(N.nodeconfigids) as nodeconfigids  
    from expectedreportsnodes N
  ) as NN 
  where NN.nodeconfigids) IN( VALUES ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN')  )
  ) as NNN on E.nodejoinkey = NNN.nodejoinkey;

This seems to give the expected results but takes ages to execute.

What can be done to improve the query?

Updates:

the proposed answer with array overlap and indexes is vastly less efficient on my set-up. I'm not able to say why.
the following version seems to be the quickiest (again, not the least idea why - perhaps because I generally have few values in values in nodeconfigids?):

_

select E.pkid, E.nodejoinkey from expectedreports E
inner join (
  select NN.nodejoinkey, NN.nodeconfigids
  from (
    select N.nodejoinkey, N.nodeconfigids, 
           generate_subscripts(N.nodeconfigids,1) as v
    from expectedreportsnodes N
  ) as NN
  where NN.nodeconfigids[v] in(values ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN') )
) as NNN
on E.nodejoinkey = NNN.nodejoinkey

To optimize performance, we generally need more specific data. Consider the [tag info for `[postgresql-performance]`](http://stackoverflow.com/tags/postgresql-performance/info). — Erwin Brandstetter, Jan 20 '15 at 12:05
BTW, your `IN (VALUES ...)` expression is invalid syntax. You are mixing row and set syntax. — Erwin Brandstetter, Jan 20 '15 at 12:42
There is no need for a `values` keyword when using `IN` just use `in ('cf1', 'cf2', ...)` — , Jan 21 '15 at 13:51
@a_horse_with_no_name: I read that it was better with a big number of values: http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#Predicate_IN_optimalization . Is it generally false ? (I will need to bench it, to) — fanf42, Jan 21 '15 at 13:55
Interesting, I have never seen that. Worth investigating though — , Jan 21 '15 at 14:03
@a_horse_with_no_name: `IN` doesn't scale well with long lists. I have same related notes in my answer. — Erwin Brandstetter, Jan 21 '15 at 14:16
@ErwinBrandstetter: I understand that. But does wrapping the list in a `values` clause really change that? — , Jan 21 '15 at 14:18
@a_horse_with_no_name: That variant is new for me, too. But the source is good. The author is [Pavel](http://stackoverflow.com/users/406691/pavel-stehule). — Erwin Brandstetter, Jan 21 '15 at 14:22
I took the liberty and removed one set of parentheses that would spoil the syntax. — Erwin Brandstetter, Jan 21 '15 at 14:31

score 3 · Accepted Answer · edited May 23 '17 at 12:32

The key to performance is a GIN index on the array column. And work with operators that can use the index.

CREATE INDEX ern_gin_idx ON expectedreportsnodes USING gin (nodeconfigids);

Query:

SELECT e.pkid, nodejoinkey 
FROM   expectedreports e
JOIN   expectedreportsnodes n USING (nodejoinkey)
WHERE  n.nodeconfigids && '{cf1, cf2, ..., cfN}'::text[];

This should work just fine for arrays of text, because the overlap operator && is supported by the default GIN operator class. Per documentation:

Name        Indexed Data Type  Indexable Operators
...
_text_ops   text[]             && <@ = @>
...

Also be sure to have a plain btree index on expectedreports.nodejoinkey:

CREATE INDEX expectedreports_nodejoinkey_idx ON expectedreports (nodejoinkey);

Optimize with multicolumn index

To optimize further for the given query, you could include the otherwise not useful column nodejoinkey to the index to allow index-only scans.

To include the integer column, first install the additional module btree_gin, which provides the necessary GIN operator classes. Run once per database:

CREATE EXTENSION btree_gin;

Then:

CREATE INDEX ern_multi_gin_idx ON expectedreportsnodes
USING gin (nodejoinkey, nodeconfigids);

Same query.
Related answers with more details:

Alternative with `unnest()`

If the GIN index isn't an option (or doesn't perform to your expectations) you can still optimize the query.

It's particularly efficient to unnest long input arrays (or use a VALUES expression like in your example) and then join to the derived table. An IN construct is typically the slowest option.

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT n.nodejoinkey 
   FROM  (SELECT nodejoinkey, unnest(nodeconfigids) AS nodeconfigid
          FROM   expectedreportsnodes) n
   JOIN  (VALUES ('cf1'), ('cf2'), ..., ('cfN')) t(nodeconfigid) USING (nodeconfigid)
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Modern form in Postgres 9.3+ with an implicit JOIN LATERAL:

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT n.nodejoinkey 
   FROM  expectedreportsnodes n
       , unnest(n.nodeconfigids) nodeconfigid
   JOIN  unnest('{cf1, cf2, ..., cfN}'::text[]) t(nodeconfigid) USING (nodeconfigid)
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Your original query can produce duplicate rows in the result. Folded with DISTINCT.
Details for JOIN LATERAL:
- Dynamically execute query using the output of another query

For short input arrays, an `ANY` construct is faster:

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT e.nodejoinkey 
   FROM   expectedreportsnodes e
   JOIN   unnest(e.nodeconfigids) u(nodeconfigid) 
          ON u.nodeconfigid = ANY ('{cf1, cf2, ..., cfN}'::text[])
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Aside: [Postgres 9.4 brought a couple of performance improvements for GIN indexes](http://www.postgresql.org/docs/9.4/static/release-9-4.html#AEN118991). The current version should be very attractive for you. — Erwin Brandstetter, Jan 19 '15 at 15:02
9.4 brought goodness in many ways... But I'm blocked and must support postgres user installation, from 8.4 — fanf42, Jan 20 '15 at 08:20
@fanf42: All of the above should work for pg 8.4+, and much faster than without GIN index. Just not quite as efficient as in the current version. I added some alternatives without GIN index, but the GIN index is still the silver bullet. — Erwin Brandstetter, Jan 20 '15 at 12:52
Thanks, that definitly helps. I'm going to set-up a new test bench and test them. In all case, the original question is considered answered, since I have all I need to know to test and optimize — fanf42, Jan 21 '15 at 13:52

score 1 · Answer 2 · answered Jan 19 '15 at 13:47

The following avoids the unnesting of the array and might be faster:

select E.pkid, E.nodejoinkey 
from expectedreports E 
  join expectedreportsnodes nn on E.nodejoinkey = NNN.nodejoinkey
where nn.nodeconfigids && array['cf1', 'cf2', ..., 'cf1000', ..., 'cfN'];

It will return rows from expectedreportsnodes where any of the values in the array appear in the nodeconfigids column.

Looking in array with a big set of input values

2 Answers2

Optimize with multicolumn index

Alternative with unnest()

For short input arrays, an ANY construct is faster:

Alternative with `unnest()`

For short input arrays, an `ANY` construct is faster: