I'm trying to figure out if the query I'd like to do is at all doable or feasible in SQL or if I need to collect raw data and process it in my application.
My schema looks like this:
applications
================
id INT
application_steps
=================
id INT
application_id INT
step_id INT
activated_at DATE
completed_at DATE
steps
=====
id INT
step_type_id INT
Ideally, with this data in application_steps
:
| id | application_id | step_id | activated_at | completed_at |
| 1 | 1 | 1 | 2013-01-01 | 2013-01-02 |
| 2 | 1 | 2 | 2013-01-02 | 2013-01-02 |
| 3 | 1 | 3 | 2013-01-02 | 2013-01-10 |
| 4 | 1 | 4 | 2013-01-10 | 2013-01-11 |
| 5 | 2 | 1 | 2013-02-02 | 2013-02-02 |
| 6 | 2 | 2 | 2013-02-02 | 2013-02-07 |
| 7 | 2 | 4 | 2013-02-09 | 2013-02-11 |
I want to get this result:
| application_id | step_1_days | step_2_days | step_3_days | step_4_days |
| 1 | 1 | 0 | 8 | 1 |
| 2 | 0 | 5 | NULL | 2 |
Note that in reality there are many more steps and many more applications that I would be looking at.
As you can see, there is a has-many relation between applications
and application_steps
. It is also possible for a given step to not be in use for a particular application. I'd like to get the amount of time each step takes (using DATEDIFF(completed_at, activated_at)
), all in one row (the column names don't matter). Is this at all possible?
Secondary question: To complicate things a bit further, I will also need a secondary query which joins application_steps
with steps
and only gets data for steps with a particular step_type_id
. Assuming part one is possible, how can I extend it to filter efficiently?
NOTE: Efficiency is key here - this is for a yearly report, which equates to about 2500 applications
with 70 different steps
and 44,000 application_steps
in production (not a lot of data, but potentially a lot when joins are factored in).