SQL performance issues with inner selects in PostgreSQL for tabulated report

Question

Using a PostgreSQL database:

I have a survey application where users enter activities and answer questions about their activities. The survey itself is called RECALLS_T, entered events are EVENTS_T, and answers are ANSWERS_T. Answers are for provided activity questions, stored in ACTIVITY_QUESTIONS_T, mapped by Lookup (LOOKUP_T).

Then I need to run an event-based report where every row is an event from EVENTS_T for every recall (all events combined for all recalls). However, some columns in that report need to indicate a value for certain answers, otherwise those cells are NULL. So this is a tabulated report.

Example (easy flat stuff first, then the complicated tabulated stuff):

RecallID | RecallDate | Event |..| WalkAlone | WalkWithPartner |..| ExerciseAtGym
256      | 10-01-19   | Exrcs |..| NULL      | NULL            |..| yes
256      | 10-01-19   | Walk  |..| yes       | NULL            |..| NULL
256      | 10-01-19   | Eat   |..| NULL      | NULL            |..| NULL
257      | 10-01-19   | Exrcs |..| NULL      | NULL            |..| yes

My SQL has inner selects for the tabulated answer-based columns and looks like this:

select 
-- Easy flat stuff first
r.id as recallid, r.recall_date as recalldate, ... ,

-- Example of Tabulated Columns:
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l 
where l.id=aq.answer_choice_id and aq.question_id=13 
and aq.id=ans.activity_question_id and aq.activity_id=27 and ans.event_id=e.id) 
     as transportationotherintensity,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=66 and l.id=aq.answer_choice_id and aq.question_id=14
and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as commutework,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=67 and l.id=aq.answer_choice_id and aq.question_id=14 and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as commuteschool,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=95 and l.id=aq.answer_choice_id and aq.question_id=14 and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as dropoffpickup,

The SQL works and the report gets rendered, but the performance is bad. I verified that it's proportionally bad: there's no magic bullet for a specific item that would fix it. Every inner select contributes to the bad performance. The result set of 1,000 rows takes 15 seconds, but should take no more than 2-3 seconds.

Note that these indexes already exist:

ANSWERS_T: on ACTIVITY_QUESTION_ID, EVENT_ID
EVENTS_T: on RECALL_ID
ACTIVITY_QUESTIONS_T: on ACTIVITY_ID, QUESTION_ID, ANSWER_CHOICE_ID

Is there something I'm doing wrong with these inner selects?

score 6 · Accepted Answer · edited Sep 01 '20 at 12:52

6

To summarize the questions, you want to use conditional aggregation. In PostgreSQL, you can use:

select ans.event_id,
       max(l.description) filter (where aq.question_id = 13 and aq.activity_id = 27) as transportationotherintensity
       max(l.description) filter (where l.id = 66 and aq.question_id = 14 and aq.activity_id = 67) as commutework,
       . . .
from activity_questions_t aq join
     lookup_t l
     on l.id = aq.answer_choice_id join
     answers_t ans
     on aq.id = ans.activity_question_id
group by ans.event_id

edited Sep 01 '20 at 12:52

Peter Mortensen

30,738
21
105
131

answered Oct 24 '19 at 00:21

Gordon Linoff

1,242,037
58
646
786

Let me ask if you, if I rewrite the query with 1-line single Select's from Java for each column that accumulate data in a collection, will that be faster? I can replace these 40 sub-selects with 40 one-line DAO fetches from the Java Service class. Would there be gains from flat fetches? Right now they're nested in a big select with an Order By and all the other dependent clauses. – gene b. Oct 25 '19 at 00:23
@geneb. . . In general, I think that running a single query in the database is going to be the fastest solution. However, you can compare on your data and your system to see which works best for you. – Gordon Linoff Oct 25 '19 at 03:12
I researched this "FILTER" answer further, and this seems like a hack/workaround. "FILTER" is meant for aggregate functions with a condition, such as: "SUM (CASE WHEN ... THEN .. ELSE END". Here we're using a "fake" aggregate function to satisfy a FILTER, which is "MAX(l.description)", but we're not really interested in a max of any sort. We just want the column returned. Am I correct that this isn't a straightforward improvement, but rather a hack? – gene b. Jan 21 '20 at 16:17
@geneb. . . . This is a perfectly valid way to express the logic in SQL. – Gordon Linoff Jan 21 '20 at 22:51
1

Thanks. You're right, my report went from 17 sec to 1 sec with this FILTER approach. Much appreciated. – gene b. Jan 23 '20 at 01:39

SQL performance issues with inner selects in PostgreSQL for tabulated report

1 Answers1

Linked