0

I'm trying to figure out if the query I'd like to do is at all doable or feasible in SQL or if I need to collect raw data and process it in my application.

My schema looks like this:

applications
================
id INT

application_steps
=================
id INT
application_id INT
step_id INT
activated_at DATE
completed_at DATE

steps
=====
id INT
step_type_id INT

Ideally, with this data in application_steps:

| id | application_id | step_id | activated_at | completed_at |
| 1  | 1              | 1       | 2013-01-01   | 2013-01-02   |
| 2  | 1              | 2       | 2013-01-02   | 2013-01-02   |
| 3  | 1              | 3       | 2013-01-02   | 2013-01-10   |
| 4  | 1              | 4       | 2013-01-10   | 2013-01-11   |
| 5  | 2              | 1       | 2013-02-02   | 2013-02-02   |
| 6  | 2              | 2       | 2013-02-02   | 2013-02-07   |
| 7  | 2              | 4       | 2013-02-09   | 2013-02-11   |

I want to get this result:

| application_id | step_1_days | step_2_days | step_3_days | step_4_days |
| 1              | 1           | 0           | 8           | 1           |
| 2              | 0           | 5           | NULL        | 2           |

Note that in reality there are many more steps and many more applications that I would be looking at.

As you can see, there is a has-many relation between applications and application_steps. It is also possible for a given step to not be in use for a particular application. I'd like to get the amount of time each step takes (using DATEDIFF(completed_at, activated_at)), all in one row (the column names don't matter). Is this at all possible?

Secondary question: To complicate things a bit further, I will also need a secondary query which joins application_steps with steps and only gets data for steps with a particular step_type_id. Assuming part one is possible, how can I extend it to filter efficiently?

NOTE: Efficiency is key here - this is for a yearly report, which equates to about 2500 applications with 70 different steps and 44,000 application_steps in production (not a lot of data, but potentially a lot when joins are factored in).

Daniel Vandersluis
  • 91,582
  • 23
  • 169
  • 153

1 Answers1

1

This should be a basic "pivoting" aggregation:

select id,
       max(case when step_id = 1 then datediff(completed_at, activated_at) end) as step_1_days,
       max(case when step_id = 2 then datediff(completed_at, activated_at) end) as step_2_days,
       max(case when step_id = 3 then datediff(completed_at, activated_at) end) as step_3_days,
       max(case when step_id = 4 then datediff(completed_at, activated_at) end) as step_4_days
from application_steps s
group by id;

You would have to repeat this for all 70 steps.

To do this only for a particular type of step:

select application_id,
       max(case when step_id = 1 then datediff(completed_at, activated_at) end) as step_1_days,
       max(case when step_id = 2 then datediff(completed_at, activated_at) end) as step_2_days,
       max(case when step_id = 3 then datediff(completed_at, activated_at) end) as step_3_days,
       max(case when step_id = 4 then datediff(completed_at, activated_at) end) as step_4_days
from application_steps s join
     steps
     on s.step_id = steps.id and
        steps.step_type_id = XXX
group by application_id;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • should that `max` be `sum`? – Praveen Lobo Sep 04 '13 at 03:25
  • @Lobo . . . I don't know if it should be `max()` or `sum()`, and it may not make a difference. Which depends on the data and the business requirements. When there is only one value expected per pivot column, I usually use `max()`. – Gordon Linoff Sep 04 '13 at 11:37
  • @DanielVandersluis an aggregator is a must to avoid nulls; whether its `max()` or `sum()` depends on what @GordonLinoff mentioned. – Praveen Lobo Sep 04 '13 at 14:01