1

The output I'm looking for is a record for each individual day, with a count of DISTINCT Active Members and a count of the DISTINCT Rolling 7 day Active Members.

I'm working with two tables -- [transactions] which has a record for every transaction through our system, including a dmn_date_id for the transaction processing date, and a DATE_DIMENSION which has every day between 1990 and 2020 with day_offset_nbr representing the days offset from TODAY() (this table is rebuilt nightly).

SELECT
    dd.date_tms
    , COUNT(DISTINCT(t.user_id)) as Active_Members_Today
    , COUNT(DISTINCT(t2.user_id)) as Rolling_7Day_Active_Members
FROM
    transactions t
    JOIN DATE_DIMENSION dd
        ON t.transaction_processed_date_id = dd.date_id
    JOIN DATE_DIMENSION d2
        ON d2.day_offset_nbr BETWEEN (dd.day_offset_nbr - 6) AND dd.day_offset_nbr
    JOIN transactions t2
        ON d2.dmn_date_id = t2.common_trans_processed_date_id
WHERE
    t.user_initiated_ind = 'Y' -- Only User-initiated Transactions
    AND t2.user_initiated_ind = 'Y'
    AND t.transaction_state_name = 'Accepted' -- Accepted Transactions Only
    AND t2.transaction_state_name = 'Accepted'  
    AND dd.day_offset_nbr >= -731
    AND d2.day_offset_nbr >= -738
GROUP BY
    dd.date_tms

The above query gets me what we need but it takes a long time (over an hour) to process. We've created indexes on user_initiated_ind and common_trans_processed_date_id including the user_id but the query is still spinning away.

Thinking I could resolve this with a window function it was learned you can't use DISTINCT in a window, which led me to using a derived table with Row_Number() to identify DISTINCT user's within the window, which led me to discover Row_Number() can't be used with a ROWS BETWEEN clause in a Window...

I've read Calculate running total / running balance which is similar but deals with a running balance rather than rolling distinct count, and I'm not quite sure how to apply what I've read to my problem; only that the INNER JOIN method I'm using is suggested as NOT the method to use...

...which inspired me to post.

How can I improve my query performance time?

Community
  • 1
  • 1
dijikul
  • 148
  • 4
  • 11

1 Answers1

1

The second join to transactions seems entirely unnecessary. You can do the counts using conditional aggregation. So, this version might be faster:

SELECT d2.date_tms,
       count(distinct case when d2.day_offset_nbr = dd.day_offset_nbr then t.user_id end) as Active_Members_Today,
       count(distinct t.user_id) as Rolling_7Day_Active_Members
FROM transactions t JOIN
     DATE_DIMENSION dd
     ON t.transaction_processed_date_id = dd.date_id JOIN
     DATE_DIMENSION d2
     ON d2.day_offset_nbr BETWEEN dd.day_offset_nbr AND (dd.day_offset_nbr + 6) AND 
WHERE t.user_initiated_ind = 'Y' AND -- Only User-initiated Transactions
      t.transaction_state_name = 'Accepted' AND -- Accepted Transactions Only
      dd.day_offset_nbr >= -731
GROUP BY d2.date_tms;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Using a CASE within a COUNT DISTINCT is an approach I hadn't considered! I like it, but it's not clear to me on how this will return correct results for a couple reasons: - *dt* isn't a defined table alias, so the _Rolling_7Day_Active_Members_ count won't work - *t* is joined to *dd* and we group by _dd.day_ which leaves me wondering how the previous days' counts get aggregated into the _Rolling_7Day_Active_Members_? – dijikul Dec 09 '15 at 18:09
  • @dijikul . . . The `dt` was a typo. As for the rest of your question, this produces seven rows for each row being aggregated . . . that gives the 7 day count. – Gordon Linoff Dec 09 '15 at 23:22
  • your answer produces inaccurate results. Grouping by dd.date_tms results in the second count having all the counts outside that day being removed. The resulting set ends up being a list of days with two columns containing the same counts. Joining transactions to the first date table is why. – dijikul Dec 10 '15 at 15:07
  • @dijikul . . . Duh. I see that. I think the solution is to aggregate by `d2.date_tms` instead. – Gordon Linoff Dec 11 '15 at 04:22