I'm creating a step funnel report using SQL.
It returns rows such as:
delivered_email,anonymous_id,opened_email,step1_delivered,step2_opened,step3_landing_page,step4_cta_clicked,steps_completed
email1@example.com,,,true,false,false,false,1
email2@example.com,id2,email2@example.com,true,true,true,true,4
email2@example.com,id3,email2@example.com,true,true,false,false,2
There are multiple entries for the same email address because these people have engaged in multiple sessions. However, in this case I am only interested in each person's session with the most steps completed. E.g. the actual result in the above case should have had 2 rows instead of 3 rows, where for email2@example.com only the case where steps_completed = 4 is returned:
delivered_email,anonymous_id,opened_email,step1_delivered,step2_opened,step3_landing_page,step4_cta_clicked,steps_completed
email1@example.com,,,true,false,false,false,1
email2@example.com,id2,email2@example.com,true,true,true,true,4
Normally one would do this by joining the result with each user's max(steps_completed)
, as described on Stackoverflow. However, in my case the steps_completed
column is actually computed as part of another subquery. Therefore creating a join on it would require me to copy-paste the entire subquery, which would be impossible to maintain.
This is the query:
select
*
from
(
-- Counts for each sesssion how many steps were completed
-- This can be used to only select the session with the most steps completed for each unique email address
select
*,
if(step1_delivered, 1, 0) +
if(step2_opened, 1, 0) +
if(step3_landing_page, 1, 0) +
if(step4_cta_clicked, 1, 0)
as steps_completed
from
(
-- Below subquery combines email addresses with associated anonymous_ids
-- Note that a single email may have multiple entries here if they used multiple devices
-- In the rest of the funnel we are interested only in the case grouped by email with the most steps completed
select
t_delivered.email as delivered_email,
t_identifies.id as anonymous_id,
t_opened.email as opened_email,
t_delivered.email is not null as step1_delivered,
coalesce(t_opened.email, t_identifies.id) is not null as step2_opened,
t_landing_page.id is not null as step3_landing_page,
t_cta_clicked.id is not null as step4_cta_clicked
-- Step 1: Retrieve emails to which opener was sent
from
(
select context_traits_email as email
from drip.email_delivered
where email_subject like '%you are invited%'
group by email
) as t_delivered
-- Retrieve the anonymous_id for each email, if set (i.e. if identified)
-- Note that if we have identified a user we will assume they have opened the email
left join
(
select
email,
anonymous_id as id
from javascript.identifies
group by email, anonymous_id
) as t_identifies
on t_identifies.email = t_delivered.email
-- Step 2: retrieve which email addresses opened the opener email
left join
(
select context_traits_email as email
from drip.email_opened
group by email
) as t_opened
on t_opened.email = t_delivered.email
-- Step 3: landing page visited
left join
(
select anonymous_id as id
from javascript.pages
where context_page_title = 'Homepage'
group by anonymous_id
) as t_landing_page
on t_landing_page.id = t_identifies.id
-- Step 4: CTA clicked
left join
(
select anonymous_id as id
from javascript.dtc_file_selection_initiated
group by anonymous_id
) as t_cta_clicked
on t_cta_clicked.id = t_identifies.id
)
)
How would I essentially group this result by delivered_email
while the results (before grouping) are ordered by steps_completed
(desc) without repeating my subqueries?