2

I have database design (in PostgreSQL 9.6) where in one table company data is stored. Each company can have one or more contact persons whose details are layed out in another table. The (simplified) schema being something like that:

DROP TABLE IF EXISTS test_company;
CREATE TABLE test_company (id integer, company_name text, contact_person integer[]);

DROP TABLE IF EXISTS test_contact_person;
CREATE TABLE test_contact_person (id integer, person_name text);

Now consider data like this:

INSERT INTO test_company (id, company_name, contact_person) VALUES (1, 'Foo Ldt.', '{1,2}');
INSERT INTO test_company (id, company_name, contact_person) VALUES (2, 'Foo Sub Inc.', '{1,2}');
INSERT INTO test_company (id, company_name, contact_person) VALUES (3, 'Foo Sub Sub Inc.', '{1}');
INSERT INTO test_company (id, company_name, contact_person) VALUES (4, 'Bar Inc.', '{3,4}');
INSERT INTO test_company (id, company_name, contact_person) VALUES (5, 'Foo-Bar Joint-Venture', '{2,3,4}');

INSERT INTO test_contact_person(id, person_name) VALUES (1,'John');
INSERT INTO test_contact_person(id, person_name) VALUES (2,'Maria');
INSERT INTO test_contact_person(id, person_name) VALUES (3,'Bill');
INSERT INTO test_contact_person(id, person_name) VALUES (4,'Jane');

You see, one person could be the contact for multiple companies, even the "pairs" (like '{1,2}' could be the same).

Now the requirements, that when querying companies are:

  • One row per company
  • The details of the contact persons should be returned all in one column as JSON array like [{"id":1,"person_name":"John"}]

Right now, I'm solving that problem with a subquery like so:

SELECT
id,
company_name,
 (
  SELECT json_agg(my_subquery) FROM
   (
     SELECT id, person_name FROM test_contact_person
     WHERE id = ANY(test_company.contact_person)
   )
  AS my_subquery 
)
contact_person_expanded
FROM test_company;

That gives me the expected result. However (as always) performance is not satisfying. BTW: There are currently no indexes on either table. I'm wondering now:

  • Would using JOINs make the query faster? If yes: How exactly would I make a JOIN return the JSON array here?
  • Would using indexes improve the performance? If yes: What kind of index on which column?

Update

Just for reference I'd like to point out that the solution suggested by Radim Bača seems to work in terms of gaining performance.

First I put in much more data with an ugly plv8-loop

DROP TABLE IF EXISTS test_company;
CREATE TABLE test_company (id integer, company_name text, contact_person integer[]);

DROP TABLE IF EXISTS test_contact_person;
CREATE TABLE test_contact_person (id integer, person_name text);

DO $$
 for(var i = 1; i < 20000; i++) {
   plv8.execute('INSERT INTO test_contact_person(id, person_name) VALUES ($1,$2)',[i,'SomePerson' + i]);
 }

for(var i = 1; i < 10000; i++) {
   plv8.execute('INSERT INTO test_company (id, company_name, contact_person) VALUES ($1,$2,$3)',[i,'SomeCompany' + i,[i,(20 -i)]]);
 }
$$ LANGUAGE plv8;

I then tried my version of the query again:

SELECT
id,
company_name,
 (
  SELECT json_agg(my_subquery) FROM
   (
     SELECT id, person_name FROM test_contact_person
     WHERE id = ANY(test_company.contact_person)
   )
  AS my_subquery 
)
contact_person_expanded
FROM test_company;

That gives me (always measured on my local machine in pgAdmin 3) about 23 seconds execution time, compared to

SELECT
  comp.id,
  comp.company_name,
  json_agg(json_build_object('id', pers.id, 'person_name', pers.person_name)) AS contact_person_expanded
FROM test_company comp
JOIN test_contact_person pers ON comp.contact_person @> ARRAY[pers.id]
GROUP BY comp.id, comp.company_name

Which takes about 47 seconds - without index.

Finally, I added an index:

DROP INDEX IF EXISTS idx_testcompany_contactperson;
CREATE INDEX idx_testcompany_contactperson on test_company USING GIN ("contact_person");

The execution time of the version with subquery doesn't change, however when using the JOIN the effect is dramatic: 1.1 seconds!

BTW: I once heard that in subquery test_company.contact_person @> ARRAY[id] is faster than id = ANY(test_company.contact_person). As far as I tested, that is not true. In my case, the latter version returned all rows in 23 seconds, while the first took 46 seconds.

cis
  • 1,259
  • 15
  • 48

1 Answers1

1

I would use common relational approach for M:N cardinality

CREATE TABLE company (cid integer primary key, company_name text);
CREATE TABLE contact_person (pid integer primary key, person_name text);
CREATE TABLE contact(
    cid integer references company,
    pid integer references contact_person,
    primary key(cid, pid)
);

For the first person, you simply add the following values

INSERT INTO contact VALUES (1, 1);
INSERT INTO contact VALUES (1, 2);
-- and so on

If you then need companies together with their contacts you simply use the following JOIN and JSON aggregation

SELECT c.cid, 
   c.company_name,
   json_agg(json_build_object('id', cp.pid, 'person_name', cp.person_name)) 
FROM company c
JOIN contact ct ON c.cid = ct.cid
JOIN contact_person cp ON cp.pid = ct.pid
GROUP BY c.cid, c.company_name

demo

Indexes are automatically created with primary keys, therefore, the performance should be ok. The question is: you really want all companies and all their contacts? No filter at all?

EDIT based on your comments I would at least rewrite your query using the JOIN instead of a correlated subquery. It may help the optimizer to find a better plan.

SELECT
  comp.id,
  comp.company_name,
  json_agg(json_build_object('id', pers.id, 'person_name', pers.person_name))
FROM test_company comp
JOIN test_contact_person pers ON comp.contact_person @> ARRAY[pers.id]
GROUP BY comp.id, comp.company_name

This notation should allow Postgresql to use an GIN index like this

CREATE INDEX idx_testcompany_contactperson on test_company USING GIN ("contact_person");
Radim Bača
  • 10,646
  • 1
  • 19
  • 33
  • Does not meat my requirement. I need one row per company with an array of contact persons. Filtering is a different problem, which is not directly related here, I think. – cis Nov 14 '17 at 08:18
  • @cis updated, I believe filters are closely related to the performance and indexes of the whole solution – Radim Bača Nov 14 '17 at 08:27
  • Thanks! That gives me at least some hints. However, re-designing dozends of tables and columns is out of the question, I'm afraid. So, I guess I'm stuck with my database design and the problem is just how to optimize the queries concerning those cases. Of course, filters are related to performance and indexes, but I don't see the direct connection to the problem of "subquery vs JOIN for JSON arrays" right now. – cis Nov 14 '17 at 10:04
  • @cis ok, I have at least added a different query corresponding to your current schema. – Radim Bača Nov 14 '17 at 10:50