Is there a way to optimise an array of subquery in a SQL select?

Question

I currently have two tables

question
--------
id
title, character varying

answer
--------
id
question_id
votes, integer

I use the following query to return me a list of questions with its corresponding array of votes:

SELECT question.id,
    question.title,
    ARRAY(SELECT votes
          FROM answer
          WHERE answer.question_id = question.id)
FROM question
ORDER BY question.id

The output looks like:

id  | title    | ?column?                       
----+----------+-----------------------------------------------------
100 | How to   | {5,2,7}
101 | Where is | {0}
102 | What is  | {1}

The above query can take close to 50s to run with hundred of thousands of questions where each question can have at least 5 answers. Is there a way to optimise the above?

Added postgresql tag b/c that looks like postgres-specific syntax. Please feel free remove tag if that's not a correct assessment. — mechanical_meat, Mar 05 '12 at 20:34
What is the type of the `votes` column of table `answer`? Is it an array, or a single integer value? I assume it's a single `int` value. Your query doesn't work if it's an array. — Joey Adams, Mar 05 '12 at 20:49

score 2 · Answer 1 · edited May 23 '17 at 10:24

You should use a join:

SELECT question.id, question.title, answer.votes
FROM question
JOIN answer ON answer.question_id == question.id
ORDER BY question.id

If you want the output column to contain a concatenated list of all "votes" associated with a question, and you are on Postgres, check out this question: How to concatenate strings of a string field in a PostgreSQL 'group by' query?

score 0 · Answer 2 · answered Mar 05 '12 at 21:02

If you want the query to produce one row per question, with votes gathered into an array, you can use a join, with array_agg:

SELECT question.id,
    question.title,
    array_agg(answer.votes) as answer_votes
FROM question
JOIN answer ON answer.question_id = question.id
GROUP BY question.id, question.title
ORDER BY question.id

score 0 · Accepted Answer · edited May 23 '17 at 11:55

I recommend creating an index on your answer table, and using your original query.

CREATE INDEX answer_question_id_idx ON answer(question_id);

Without this index, it will have to do a sequential scan of the entire table to find rows with a matching question_id. It will have to do that for every single question.

Alternatively, consider using a join, as arc suggested. I'm not an expert in the matter, but I think Postgres will use a hash join rather than multiple sequential scans, making the query faster. If you want to retain the id/title/array format, use array_agg:

SELECT question.id, question.title, array_agg(answer.votes)
  FROM question
  LEFT JOIN answer ON answer.question_id = question.id
 GROUP BY question.id, question.title
 ORDER BY question.id;

However, there's a caveat. If a question has no answers, you'll get a weird-looking result:

 id |       title       | array_agg 
----+-------------------+-----------
  1 | How do I do this? | {3,5}
  2 | How do I do that? | {NULL}
(2 rows)

This is because of the LEFT JOIN, which creates a NULL value when no rows from the joined table are available. With INNER JOIN, the second row won't appear at all.

That's why I recommend using your original query. It produces the expected result:

 id |       title       | ?column? 
----+-------------------+----------
  1 | How do I do this? | {3,5}
  2 | How do I do that? | {}

I already have this index in place and it still takes a while. I don't see a lot of improvements with the joining method after using `EXPLAIN`. — Thierry Lam, Mar 05 '12 at 21:15

Is there a way to optimise an array of subquery in a SQL select?

3 Answers3