5

I cannot seem to find the trick to join two tables through an array column when one table is not an array value, and the other table's array value can contain multiple values. It does work when there is a single valued array.

Here's a simple minimal example of what I'm talking about. The real tables have GIN indexes on the array columns FWIW. These do not, but the query behaves the same.

DROP TABLE IF EXISTS eg_person;
CREATE TABLE eg_person (id INT PRIMARY KEY, name TEXT);
INSERT INTO eg_person (id, name) VALUES
  (1, 'alice')
, (2, 'bob')
, (3, 'charlie');

DROP TABLE IF EXISTS eg_assoc;
CREATE TABLE eg_assoc (aid INT PRIMARY KEY, actors INT[], benefactors INT[]);
INSERT INTO eg_assoc (aid, actors, benefactors) VALUES
  (1, '{1}'  , '{2}')
, (2, '{1,2}', '{3}')
, (3, '{1}'  , '{2,3}')
, (4, '{4}'  , '{1}');

SELECT aid, actors, a_person.name, benefactors, b_person.name 
FROM   eg_assoc
LEFT   JOIN eg_person a_person on array[a_person.id] @> eg_assoc.actors
LEFT   JOIN eg_person b_person on array[b_person.id] @> eg_assoc.benefactors;

The actual results are this like so. The issue here is that name column comes up NULL if either actors or benefactors contains more than one value.

 aid | actors | name  | benefactors |  name   
-----+--------+-------+-------------+---------
   1 | {1}    | alice | {2}         | bob
   2 | {1,2}  |       | {3}         | charlie
   3 | {1}    | alice | {2,3}       | 
   4 | {4}    |       | {1}         | alice

I was expecting this:

 aid | actors | name  | benefactors |  name   
-----+--------+-------+-------------+---------
   1 | {1}    | alice | {2}         | bob
   2 | {1,2}  | alice | {3}         | charlie
   2 | {1,2}  | bob   | {3}         | charlie
   3 | {1}    | alice | {2,3}       | bob
   3 | {1}    | alice | {2,3}       | charlie
   4 | {4}    |       | {1}         | alice

It would be really nice if I could get it to look like this, though:

 aid | actors | name        | benefactors |  name   
-----+--------+-------------+-------------+---------
   1 | {1}    | {alice}     | {2}         | {bob}
   2 | {1,2}  | {alice,bob} | {3}         | {charlie}
   3 | {1}    | {alice}     | {2,3}       | {bob, charlie}
   4 | {4}    |             | {1}         | {alice}

I'm aware that this schema denormalized, and I'm willing to go to a normal representation if need be. However, this is for a summary query and it already involves a lot more joins than I'd like.

BillRobertson42
  • 12,602
  • 4
  • 40
  • 57

2 Answers2

6

Yes, the overlap operator && can use a GIN index on arrays. Very useful for queries like this one to find rows with a given person (1) among an array of actors:

SELECT * FROM eg_assoc WHERE actors && '{1}'::int[]

However, the logic of your query is the other way round, looking for all persons listed in the arrays in eg_assoc. A GIN index is no help here. We just need the btree index of the PK person.id.

Proper queries

Basics:

The following queries preserve original arrays exactly as given, including possible duplicate elements and original order of elements. Works for 1-dimenstional arrays. Additional dimensions are folded into a single dimension. It's more complex to preserve multiple dimensions (but totally possible):

WITH ORDINALITY in Postgres 9.4 or later

SELECT aid, actors
     , ARRAY(SELECT name
             FROM   unnest(e.actors) WITH ORDINALITY a(id, i)
             JOIN   eg_person p USING (id)
             ORDER  BY a.i) AS act_names
     , benefactors
     , ARRAY(SELECT name
             FROM   unnest(e.benefactors) WITH ORDINALITY b(id, i)
             JOIN   eg_person USING (id)
             ORDER  BY b.i) AS ben_names
FROM   eg_assoc e;

LATERAL queries

For PostgreSQL 9.3+.

SELECT e.aid, e.actors, a.act_names, e.benefactors, b.ben_names
FROM   eg_assoc e
, LATERAL (
   SELECT ARRAY( SELECT name
                 FROM   generate_subscripts(e.actors, 1) i
                 JOIN   eg_person p ON p.id = e.actors[i]
                 ORDER  BY i)
   ) a(act_names)
, LATERAL (
   SELECT ARRAY( SELECT name
                 FROM   generate_subscripts(e.benefactors, 1) i
                 JOIN   eg_person p ON p.id = e.benefactors[i]
                 ORDER  BY i)
   ) b(ben_names);

db<>fiddle here with a couple of variants.
Old sqlfiddle

Subtle detail: If a person is not found, it's just dropped. Both of these queries generate an empty array ('{}') if no person is found for the whole array. Other query styles would return NULL. I added variants to the fiddle.

Correlated subqueries

For Postgres 8.4+ (where generate_subsrcipts() was introduced):

SELECT aid, actors
     , ARRAY(SELECT name
             FROM   generate_subscripts(e.actors, 1) i
             JOIN   eg_person p ON p.id = e.actors[i]
             ORDER  BY i) AS act_names
     , benefactors
     , ARRAY(SELECT name
             FROM   generate_subscripts(e.benefactors, 1) i
             JOIN   eg_person p ON p.id = e.benefactors[i]
             ORDER  BY i) AS ben_names
FROM   eg_assoc e;

May still perform best, even in Postgres 9.3.
The ARRAY constructor is faster than array_agg(). See:

Your failed query

The query provided by @a_horse seems to do the job, but it is unreliable, misleading, potentially incorrect and needlessly expensive.

  1. Proxy cross join because of two unrelated joins. A sneaky anti-pattern. See:

    Fixed superficially with DISTINCT in array_agg()to eliminates the generated duplicates, but that's really putting lipstick on a pig. It also eliminates duplicates in the original because its impossible to tell the difference at this point - which is potentially incorrect.

  2. The expression a_person.id = any(eg_assoc.actors) works, but eliminates duplicates from the result (happens two times in this query), which is wrong unless specified.

  3. Original order of array elements is not preserved. This is tricky in general. But it's aggravated in this query, because actors and benefactors are multiplied and made distinct again, which guarantees arbitrary order.

  4. No column aliases in the outer SELECT result in duplicate column names, which makes some clients fails (not working in the fiddle without aliases).

  5. min(actors) and min(benefactors) are useless. Normally one would just add the columns to GROUP BY instead of fake-aggregating them. But eg_assoc.aid is the PK column anyway (covering the whole table in GROUP BY), so that's not even necessary. Just actors, benefactors.

Aggregating the whole result is wasted time and effort to begin with. Use a smarter query that doesn't multiply the base rows, then you don't have to aggregate them back.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • That's a lot to go over, and it's late. I see many of your points. Some of the concerns are not relevant in this particular case, but that does not mean they wouldn't be for others in a similar situation. e.g. ordering – BillRobertson42 Feb 19 '15 at 07:34
  • 1
    @Bill: Exactly, answers are for the general public. I am answering your actual question, not your (unknown) real situation. The good news: the "full-blown" version should be faster than the currently accepted answer either way, besides being conservative and correct. – Erwin Brandstetter Feb 20 '15 at 06:02
  • Have not circled back around to this yet. Nasty cold and lots of work at the moment. – BillRobertson42 Feb 20 '15 at 20:04
4

You need to use the = ANY() operator:

SELECT aid, actors, a_person.name, benefactors, b_person.name 
FROM eg_assoc
  LEFT JOIN eg_person a_person on a_person.id = any(eg_assoc.actors)
  LEFT JOIN eg_person b_person on b_person.id = any(eg_assoc.benefactors);

It would be really nice if I could get it to look like this though.

Just aggregate the values based on the aid:

SELECT aid, min(actors), array_agg(distinct a_person.name), min(benefactors), array_agg(distinct b_person.name)
FROM   eg_assoc
  LEFT JOIN eg_person a_person on a_person.id = any(eg_assoc.actors)
  LEFT JOIN eg_person b_person on b_person.id = any(eg_assoc.benefactors)
group by aid;
  • Thank you. Looks like the any operator does not work with GIN indexes (http://www.postgresql.org/docs/9.4/static/indexes-types.html) However && does, and appears to be the same thing. Also, never seen min used like that with a group by. That's awesome. – BillRobertson42 Feb 18 '15 at 18:12
  • You have posted countless excellent answers. But this one is not up to your standard. I added details to my answer. – Erwin Brandstetter Feb 19 '15 at 04:38