1

I have a table that looks like (with an example of number of rows in each to get the kind of ration):

expectedreportsnodes (1 000 000 rows):

 nodejoinkey   | integer  | not null
 nodeid        | text     | not null
 nodeconfigids | text[]   | 

nodeconfigids array generally contains 1-50 values.

And a second table:

expectedreports (10 000 rows):

 pkid       | integer  | not null
 nodejoinkey| integer  | not null
 ...

I want to query for all expected reports for which their exists a entry in nodeexpectedreports with a given nodeConfigId. I have potentially large amount of nodeConfigIds (thousands).

What is most efficient way of doing that?

For now, I have:

select E.pkid, E.nodejoinkey from expectedreports E 
inner join (
  select NN.nodejoinkey, NN.nodeid, NN.nodeconfigids from (
    select N.nodejoinkey, N.nodeid, unnest(N.nodeconfigids) as nodeconfigids  
    from expectedreportsnodes N
  ) as NN 
  where NN.nodeconfigids) IN( VALUES ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN')  )
  ) as NNN on E.nodejoinkey = NNN.nodejoinkey;

This seems to give the expected results but takes ages to execute.

What can be done to improve the query?

Updates:

  • the proposed answer with array overlap and indexes is vastly less efficient on my set-up. I'm not able to say why.
  • the following version seems to be the quickiest (again, not the least idea why - perhaps because I generally have few values in values in nodeconfigids?):

_

select E.pkid, E.nodejoinkey from expectedreports E
inner join (
  select NN.nodejoinkey, NN.nodeconfigids
  from (
    select N.nodejoinkey, N.nodeconfigids, 
           generate_subscripts(N.nodeconfigids,1) as v
    from expectedreportsnodes N
  ) as NN
  where NN.nodeconfigids[v] in(values ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN') )
) as NNN
on E.nodejoinkey = NNN.nodejoinkey
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
fanf42
  • 1,828
  • 15
  • 22
  • To optimize performance, we generally need more specific data. Consider the [tag info for `[postgresql-performance]`](http://stackoverflow.com/tags/postgresql-performance/info). – Erwin Brandstetter Jan 20 '15 at 12:05
  • BTW, your `IN (VALUES ...)` expression is invalid syntax. You are mixing row and set syntax. – Erwin Brandstetter Jan 20 '15 at 12:42
  • There is no need for a `values` keyword when using `IN` just use `in ('cf1', 'cf2', ...)` –  Jan 21 '15 at 13:51
  • 1
    @a_horse_with_no_name: I read that it was better with a big number of values: http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#Predicate_IN_optimalization . Is it generally false ? (I will need to bench it, to) – fanf42 Jan 21 '15 at 13:55
  • Interesting, I have never seen that. Worth investigating though –  Jan 21 '15 at 14:03
  • @a_horse_with_no_name: `IN` doesn't scale well with long lists. I have same related notes in my answer. – Erwin Brandstetter Jan 21 '15 at 14:16
  • @ErwinBrandstetter: I understand that. But does wrapping the list in a `values` clause really change that? –  Jan 21 '15 at 14:18
  • @a_horse_with_no_name: That variant is new for me, too. But the source is good. The author is [Pavel](http://stackoverflow.com/users/406691/pavel-stehule). – Erwin Brandstetter Jan 21 '15 at 14:22
  • I took the liberty and removed one set of parentheses that would spoil the syntax. – Erwin Brandstetter Jan 21 '15 at 14:31

2 Answers2

3

The key to performance is a GIN index on the array column. And work with operators that can use the index.

CREATE INDEX ern_gin_idx ON expectedreportsnodes USING gin (nodeconfigids);

Query:

SELECT e.pkid, nodejoinkey 
FROM   expectedreports e
JOIN   expectedreportsnodes n USING (nodejoinkey)
WHERE  n.nodeconfigids && '{cf1, cf2, ..., cfN}'::text[];

This should work just fine for arrays of text, because the overlap operator && is supported by the default GIN operator class. Per documentation:

Name        Indexed Data Type  Indexable Operators
...
_text_ops   text[]             && <@ = @>
...

Also be sure to have a plain btree index on expectedreports.nodejoinkey:

CREATE INDEX expectedreports_nodejoinkey_idx ON expectedreports (nodejoinkey);

Optimize with multicolumn index

To optimize further for the given query, you could include the otherwise not useful column nodejoinkey to the index to allow index-only scans.

To include the integer column, first install the additional module btree_gin, which provides the necessary GIN operator classes. Run once per database:

CREATE EXTENSION btree_gin;

Then:

CREATE INDEX ern_multi_gin_idx ON expectedreportsnodes
USING gin (nodejoinkey, nodeconfigids);

Same query.
Related answers with more details:

Alternative with unnest()

If the GIN index isn't an option (or doesn't perform to your expectations) you can still optimize the query.

It's particularly efficient to unnest long input arrays (or use a VALUES expression like in your example) and then join to the derived table. An IN construct is typically the slowest option.

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT n.nodejoinkey 
   FROM  (SELECT nodejoinkey, unnest(nodeconfigids) AS nodeconfigid
          FROM   expectedreportsnodes) n
   JOIN  (VALUES ('cf1'), ('cf2'), ..., ('cfN')) t(nodeconfigid) USING (nodeconfigid)
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Modern form in Postgres 9.3+ with an implicit JOIN LATERAL:

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT n.nodejoinkey 
   FROM  expectedreportsnodes n
       , unnest(n.nodeconfigids) nodeconfigid
   JOIN  unnest('{cf1, cf2, ..., cfN}'::text[]) t(nodeconfigid) USING (nodeconfigid)
   ) n
JOIN   expectedreports e USING (nodejoinkey);

For short input arrays, an ANY construct is faster:

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT e.nodejoinkey 
   FROM   expectedreportsnodes e
   JOIN   unnest(e.nodeconfigids) u(nodeconfigid) 
          ON u.nodeconfigid = ANY ('{cf1, cf2, ..., cfN}'::text[])
   ) n
JOIN   expectedreports e USING (nodejoinkey);
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Aside: [Postgres 9.4 brought a couple of performance improvements for GIN indexes](http://www.postgresql.org/docs/9.4/static/release-9-4.html#AEN118991). The current version should be very attractive for you. – Erwin Brandstetter Jan 19 '15 at 15:02
  • 9.4 brought goodness in many ways... But I'm blocked and must support postgres user installation, from 8.4 – fanf42 Jan 20 '15 at 08:20
  • @fanf42: All of the above should work for pg 8.4+, and much faster than without GIN index. Just not quite as efficient as in the current version. I added some alternatives without GIN index, but the GIN index is still the silver bullet. – Erwin Brandstetter Jan 20 '15 at 12:52
  • Thanks, that definitly helps. I'm going to set-up a new test bench and test them. In all case, the original question is considered answered, since I have all I need to know to test and optimize – fanf42 Jan 21 '15 at 13:52
1

The following avoids the unnesting of the array and might be faster:

select E.pkid, E.nodejoinkey 
from expectedreports E 
  join expectedreportsnodes nn on E.nodejoinkey = NNN.nodejoinkey
where nn.nodeconfigids && array['cf1', 'cf2', ..., 'cf1000', ..., 'cfN'];

It will return rows from expectedreportsnodes where any of the values in the array appear in the nodeconfigids column.