0

I'm creating a step funnel report using SQL.

It returns rows such as:

delivered_email,anonymous_id,opened_email,step1_delivered,step2_opened,step3_landing_page,step4_cta_clicked,steps_completed
email1@example.com,,,true,false,false,false,1
email2@example.com,id2,email2@example.com,true,true,true,true,4
email2@example.com,id3,email2@example.com,true,true,false,false,2

There are multiple entries for the same email address because these people have engaged in multiple sessions. However, in this case I am only interested in each person's session with the most steps completed. E.g. the actual result in the above case should have had 2 rows instead of 3 rows, where for email2@example.com only the case where steps_completed = 4 is returned:

delivered_email,anonymous_id,opened_email,step1_delivered,step2_opened,step3_landing_page,step4_cta_clicked,steps_completed
email1@example.com,,,true,false,false,false,1
email2@example.com,id2,email2@example.com,true,true,true,true,4

Normally one would do this by joining the result with each user's max(steps_completed), as described on Stackoverflow. However, in my case the steps_completed column is actually computed as part of another subquery. Therefore creating a join on it would require me to copy-paste the entire subquery, which would be impossible to maintain.

This is the query:

select
  *

from
(
  -- Counts for each sesssion how many steps were completed
  -- This can be used to only select the session with the most steps completed for each unique email address
  select
    *,
    if(step1_delivered, 1, 0) +
    if(step2_opened, 1, 0) +
    if(step3_landing_page, 1, 0) +
    if(step4_cta_clicked, 1, 0)
      as steps_completed

  from
  (
    -- Below subquery combines email addresses with associated anonymous_ids
    -- Note that a single email may have multiple entries here if they used multiple devices
    -- In the rest of the funnel we are interested only in the case grouped by email with the most steps completed
    select
      t_delivered.email as delivered_email,
      t_identifies.id as anonymous_id,
      t_opened.email as opened_email,
      t_delivered.email is not null as step1_delivered,
      coalesce(t_opened.email, t_identifies.id) is not null as step2_opened,
      t_landing_page.id is not null as step3_landing_page,
      t_cta_clicked.id is not null as step4_cta_clicked

    -- Step 1: Retrieve emails to which opener was sent
    from
    (
      select context_traits_email as email

      from drip.email_delivered

      where email_subject like '%you are invited%'

      group by email
    ) as t_delivered

    -- Retrieve the anonymous_id for each email, if set (i.e. if identified)
    -- Note that if we have identified a user we will assume they have opened the email
    left join
    (
      select
        email,
        anonymous_id as id

      from javascript.identifies

      group by email, anonymous_id
    ) as t_identifies

    on t_identifies.email = t_delivered.email

    -- Step 2: retrieve which email addresses opened the opener email
    left join
    (
      select context_traits_email as email
      from drip.email_opened
      group by email
    ) as t_opened

    on t_opened.email = t_delivered.email

    -- Step 3: landing page visited
    left join
    (
      select anonymous_id as id
      from javascript.pages
      where context_page_title = 'Homepage'
      group by anonymous_id
    ) as t_landing_page

    on t_landing_page.id = t_identifies.id

    -- Step 4: CTA clicked
    left join
    (
      select anonymous_id as id
      from javascript.dtc_file_selection_initiated
      group by anonymous_id
    ) as t_cta_clicked

    on t_cta_clicked.id = t_identifies.id
  )
)

How would I essentially group this result by delivered_email while the results (before grouping) are ordered by steps_completed (desc) without repeating my subqueries?

Tom
  • 8,536
  • 31
  • 133
  • 232
  • 1
    Can you make your subquery a view and then join to view for what is needed? – alexherm Jul 03 '19 at 18:07
  • @alexherm that worked, though it requires me to maintain a separate view -- do you think this is the only way? – Tom Jul 03 '19 at 18:32
  • I'm sure there is another way. But if this works well then go with it. By maintain do you mean set it up once or would you need to update it regularly? – alexherm Jul 03 '19 at 18:44
  • @alexherm I need to update it regularly.. there are some subqueries in there that query for segments of users etc that I need to change depending on the segment I am interested in – Tom Jul 03 '19 at 19:08
  • [mre] please. PS What is your question? What does "group this result by delivered_email while the results (before grouping) are ordered by steps_completed" mean? Tables have no order, so ordering before group by without limit/top has no effect, but that's clearly not what you want, so what do you want? Use enough words & references to parts of examples. Also "essentially" when not introducing or summarizing full details just means "unclearly". PS It's also not clear what this post & code has to do with "another subquery" & avioding "a join on it". PS Clarify via edits, not comments. – philipxy Jul 03 '19 at 21:25

1 Answers1

0

You should try to use CTE (aka "with clause") and numbering window functions

with

t_delivered as (
    select distinct
        context_traits_email as email
    from
        drip.email_delivered
    where
        email_subject like '%you are invited%'
),

t_identifies as (
    -- Retrieve the anonymous_id for each email, if set (i.e. if identified)
    -- Note that if we have identified a user we will assume they have opened the email
    select distinct
        email,
        anonymous_id as id
    from
        `javascript.identifies`
),

t_opened as (
    -- Step 2: retrieve which email addresses opened the opener email
    select distinct
        context_traits_email as email
    from
        `drip.email_opened`
),

t_landing_page as (
    -- Step 3: landing page visited
    select distinct
        anonymous_id as id
    from
        `javascript.pages`
    where
        context_page_title = 'Homepage'
),

t_cta_clicked as (
    -- Step 4: CTA clicked
    select distinct
        anonymous_id as id
    from
        `javascript.dtc_file_selection_initiated`
),

total_data as (
    -- Below subquery combines email addresses with associated anonymous_ids
    -- Note that a single email may have multiple entries here if they used multiple devices
    -- In the rest of the funnel we are interested only in the case grouped by email with the most steps completed
    select
        td.email as delivered_email,
        ti.id as anonymous_id,
        to.email as opened_email,
        td.email is not null as step1_delivered,
        coalesce(ti.id, to.email) is not null as step2_opened,
        tlp.id is not null as step3_landing_page,
        tcc.id is not null as step4_cta_clicked
    from
        t_delivered as td
        left join t_identifies as ti on td.email = ti.email
        left join t_opened as to on td.email = to.email
        left join t_landing_page as tlp on ti.id = tlp.id
        left join t_cta_clicked as tcc on ti.id = tcc.id
)

select
    *
from
    -- Counts for each session how many steps were completed
    -- This can be used to only select the session with the most steps completed for each unique email address
    (   select
            *,
            row_number() over(  partition by
                                    delivered_email
                                order by  -- prioritize columns here
                                    steps_completed desc,
                                    step4_cta_clicked desc,
                                    step3_landing_page desc,
                                    step2_opened desc,
                                    step1_delivered desc,
                                    anonymous_id) as rn
        from
            (   select
                    *,
                    if(step1_delivered, 1, 0)
                    + if(step2_opened, 1, 0)
                    + if(step3_landing_page, 1, 0)
                    + if(step4_cta_clicked, 1, 0) as steps_completed
                from
                    total_data
                )
        )
where
    rn = 1
Y.K.
  • 682
  • 4
  • 10