2

Using a PostgreSQL database:

I have a survey application where users enter activities and answer questions about their activities. The survey itself is called RECALLS_T, entered events are EVENTS_T, and answers are ANSWERS_T. Answers are for provided activity questions, stored in ACTIVITY_QUESTIONS_T, mapped by Lookup (LOOKUP_T).

Then I need to run an event-based report where every row is an event from EVENTS_T for every recall (all events combined for all recalls). However, some columns in that report need to indicate a value for certain answers, otherwise those cells are NULL. So this is a tabulated report.

Example (easy flat stuff first, then the complicated tabulated stuff):

RecallID | RecallDate | Event |..| WalkAlone | WalkWithPartner |..| ExerciseAtGym
256      | 10-01-19   | Exrcs |..| NULL      | NULL            |..| yes
256      | 10-01-19   | Walk  |..| yes       | NULL            |..| NULL
256      | 10-01-19   | Eat   |..| NULL      | NULL            |..| NULL
257      | 10-01-19   | Exrcs |..| NULL      | NULL            |..| yes

My SQL has inner selects for the tabulated answer-based columns and looks like this:

select 
-- Easy flat stuff first
r.id as recallid, r.recall_date as recalldate, ... ,

-- Example of Tabulated Columns:
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l 
where l.id=aq.answer_choice_id and aq.question_id=13 
and aq.id=ans.activity_question_id and aq.activity_id=27 and ans.event_id=e.id) 
     as transportationotherintensity,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=66 and l.id=aq.answer_choice_id and aq.question_id=14
and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as commutework,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=67 and l.id=aq.answer_choice_id and aq.question_id=14 and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as commuteschool,
(select l.description from answers_t ans, activity_questions_t aq, lookup_t l
where l.id=95 and l.id=aq.answer_choice_id and aq.question_id=14 and aq.id=ans.activity_question_id and ans.event_id=e.id) 
     as dropoffpickup,

The SQL works and the report gets rendered, but the performance is bad. I verified that it's proportionally bad: there's no magic bullet for a specific item that would fix it. Every inner select contributes to the bad performance. The result set of 1,000 rows takes 15 seconds, but should take no more than 2-3 seconds.

Note that these indexes already exist:

  • ANSWERS_T: on ACTIVITY_QUESTION_ID, EVENT_ID
  • EVENTS_T: on RECALL_ID
  • ACTIVITY_QUESTIONS_T: on ACTIVITY_ID, QUESTION_ID, ANSWER_CHOICE_ID

Is there something I'm doing wrong with these inner selects?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
gene b.
  • 10,512
  • 21
  • 115
  • 227

1 Answers1

6

To summarize the questions, you want to use conditional aggregation. In PostgreSQL, you can use:

select ans.event_id,
       max(l.description) filter (where aq.question_id = 13 and aq.activity_id = 27) as transportationotherintensity
       max(l.description) filter (where l.id = 66 and aq.question_id = 14 and aq.activity_id = 67) as commutework,
       . . .
from activity_questions_t aq join
     lookup_t l
     on l.id = aq.answer_choice_id join
     answers_t ans
     on aq.id = ans.activity_question_id
group by ans.event_id
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Let me ask if you, if I rewrite the query with 1-line single Select's from Java for each column that accumulate data in a collection, will that be faster? I can replace these 40 sub-selects with 40 one-line DAO fetches from the Java Service class. Would there be gains from flat fetches? Right now they're nested in a big select with an Order By and all the other dependent clauses. – gene b. Oct 25 '19 at 00:23
  • @geneb. . . In general, I think that running a single query in the database is going to be the fastest solution. However, you can compare on your data and your system to see which works best for you. – Gordon Linoff Oct 25 '19 at 03:12
  • I researched this "FILTER" answer further, and this seems like a hack/workaround. "FILTER" is meant for aggregate functions with a condition, such as: "SUM (CASE WHEN ... THEN .. ELSE END". Here we're using a "fake" aggregate function to satisfy a FILTER, which is "MAX(l.description)", but we're not really interested in a max of any sort. We just want the column returned. Am I correct that this isn't a straightforward improvement, but rather a hack? – gene b. Jan 21 '20 at 16:17
  • @geneb. . . . This is a perfectly valid way to express the logic in SQL. – Gordon Linoff Jan 21 '20 at 22:51
  • 1
    Thanks. You're right, my report went from 17 sec to 1 sec with this FILTER approach. Much appreciated. – gene b. Jan 23 '20 at 01:39