I have database design (in PostgreSQL 9.6) where in one table company data is stored. Each company can have one or more contact persons whose details are layed out in another table. The (simplified) schema being something like that:
DROP TABLE IF EXISTS test_company;
CREATE TABLE test_company (id integer, company_name text, contact_person integer[]);
DROP TABLE IF EXISTS test_contact_person;
CREATE TABLE test_contact_person (id integer, person_name text);
Now consider data like this:
INSERT INTO test_company (id, company_name, contact_person) VALUES (1, 'Foo Ldt.', '{1,2}');
INSERT INTO test_company (id, company_name, contact_person) VALUES (2, 'Foo Sub Inc.', '{1,2}');
INSERT INTO test_company (id, company_name, contact_person) VALUES (3, 'Foo Sub Sub Inc.', '{1}');
INSERT INTO test_company (id, company_name, contact_person) VALUES (4, 'Bar Inc.', '{3,4}');
INSERT INTO test_company (id, company_name, contact_person) VALUES (5, 'Foo-Bar Joint-Venture', '{2,3,4}');
INSERT INTO test_contact_person(id, person_name) VALUES (1,'John');
INSERT INTO test_contact_person(id, person_name) VALUES (2,'Maria');
INSERT INTO test_contact_person(id, person_name) VALUES (3,'Bill');
INSERT INTO test_contact_person(id, person_name) VALUES (4,'Jane');
You see, one person could be the contact for multiple companies, even the "pairs" (like '{1,2}'
could be the same).
Now the requirements, that when querying companies are:
- One row per company
- The details of the contact persons should be returned all in one column as JSON array like
[{"id":1,"person_name":"John"}]
Right now, I'm solving that problem with a subquery like so:
SELECT
id,
company_name,
(
SELECT json_agg(my_subquery) FROM
(
SELECT id, person_name FROM test_contact_person
WHERE id = ANY(test_company.contact_person)
)
AS my_subquery
)
contact_person_expanded
FROM test_company;
That gives me the expected result. However (as always) performance is not satisfying. BTW: There are currently no indexes on either table. I'm wondering now:
- Would using JOINs make the query faster? If yes: How exactly would I make a JOIN return the JSON array here?
- Would using indexes improve the performance? If yes: What kind of index on which column?
Update
Just for reference I'd like to point out that the solution suggested by Radim Bača seems to work in terms of gaining performance.
First I put in much more data with an ugly plv8-loop
DROP TABLE IF EXISTS test_company;
CREATE TABLE test_company (id integer, company_name text, contact_person integer[]);
DROP TABLE IF EXISTS test_contact_person;
CREATE TABLE test_contact_person (id integer, person_name text);
DO $$
for(var i = 1; i < 20000; i++) {
plv8.execute('INSERT INTO test_contact_person(id, person_name) VALUES ($1,$2)',[i,'SomePerson' + i]);
}
for(var i = 1; i < 10000; i++) {
plv8.execute('INSERT INTO test_company (id, company_name, contact_person) VALUES ($1,$2,$3)',[i,'SomeCompany' + i,[i,(20 -i)]]);
}
$$ LANGUAGE plv8;
I then tried my version of the query again:
SELECT
id,
company_name,
(
SELECT json_agg(my_subquery) FROM
(
SELECT id, person_name FROM test_contact_person
WHERE id = ANY(test_company.contact_person)
)
AS my_subquery
)
contact_person_expanded
FROM test_company;
That gives me (always measured on my local machine in pgAdmin 3) about 23 seconds execution time, compared to
SELECT
comp.id,
comp.company_name,
json_agg(json_build_object('id', pers.id, 'person_name', pers.person_name)) AS contact_person_expanded
FROM test_company comp
JOIN test_contact_person pers ON comp.contact_person @> ARRAY[pers.id]
GROUP BY comp.id, comp.company_name
Which takes about 47 seconds - without index.
Finally, I added an index:
DROP INDEX IF EXISTS idx_testcompany_contactperson;
CREATE INDEX idx_testcompany_contactperson on test_company USING GIN ("contact_person");
The execution time of the version with subquery doesn't change, however when using the JOIN the effect is dramatic: 1.1 seconds!
BTW: I once heard that in subquery test_company.contact_person @> ARRAY[id]
is faster than id = ANY(test_company.contact_person)
. As far as I tested, that is not true. In my case, the latter version returned all rows in 23 seconds, while the first took 46 seconds.