2

I'm using PostgreSQL in SQLWorkbenchJ and I'm struggling.

I have a WITH statement that selects a date depending on its row number. If the statement cannot find the row number, I want to select null in the date field. This currently doesn't happen, it just selects records where all fields are not null. I'm assuming it is to do with joins but I'm not sure.

The current statement is below. It should return around 50,000 records but it currently returning just under 2000.

WITH FifthEnquiry AS
(
  SELECT emailaddress,
         SentDate,
         ROW_NUMBER() OVER (PARTITION BY emailaddress ORDER BY COUNT(*) DESC) AS rk
  FROM SentEmails
  GROUP BY emailaddress,
           SentDate
),
TenthEnquiry AS
(
  SELECT emailaddress,
         SentDate,
         ROW_NUMBER() OVER (PARTITION BY emailaddress ORDER BY COUNT(*) DESC) AS rk
  FROM SentEmails
  GROUP BY emailaddress,
           SentDate
),
TwentiethEnquiry AS
(
  SELECT emailaddress,
         SentDate,
         ROW_NUMBER() OVER (PARTITION BY emailaddress ORDER BY COUNT(*) DESC) AS rk
  FROM SentEmails
  GROUP BY emailaddress,
           SentDate
)
SELECT FifthEnquiry.emailaddress,
       FifthEnquiry.SentDate AS Fith,
       TenthEnquiry.SentDate AS Tenth,
       TwentiethEnquiry.SentDate AS Twentieth,
FROM FifthEnquiry
  JOIN TenthEnquiry ON FifthEnquiry.emailaddress = TenthEnquiry.emailaddress
  JOIN TwentiethEnquiry ON FifthEnquiry.emailaddress = TwentiethEnquiry.emailaddress
WHERE (FifthEnquiry.rk = 5)
AND   (TenthEnquiry.rk = 10)
AND   (TwentiethEnquiry.rk = 20)
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Chris Hall
  • 83
  • 7

1 Answers1

3

You can largely simplify. And use LEFT JOIN to preserve all email addresses with at least 5 rows after GROUP BY, even if there is no 10th or 20th row:

WITH cte AS (
   SELECT emailaddress, SentDate,
          ROW_NUMBER() OVER (PARTITION BY emailaddress
                             ORDER BY COUNT(*) DESC, SentDate) AS rn
   FROM   SentEmails
   GROUP  BY 1,2
   )
SELECT enq05.emailaddress,
       enq05.SentDate AS fifth,
       enq10.SentDate AS tenth,
       enq20.SentDate AS twentieth
FROM        cte AS enq05
LEFT   JOIN cte AS enq10 ON enq10.emailaddress = enq05.emailaddress
                        AND enq10.rn = 10
LEFT   JOIN cte AS enq20 ON enq20.emailaddress = enq05.emailaddress
                        AND enq20.rn = 20
WHERE  enq05.rn = 5;
  • You don't need separate CTEs, all three are doing the same. One CTE is enough, and obviously faster. Use self-joins with different table aliases in the outer query instead.

  • Since we are using LEFT JOIN now, it matters whether we put additional conditions in the JOIN or the WHERE clause. Conditions in the WHERE clause effectively force Postgres to treat the join like a plain [INNER] JOIN. I moved conditions to the JOIN clause accordingly. Details:

  • Using rn, not rk as column alias. It's a "row number", not a "rank". Be aware of the important difference in behavior between row_number() and rank().

  • Add SentDate to ORDER BY as tiebreaker for (emailaddress, SentDate) with the same count to get a stable sort order. The way I have it SentDate IS NULL comes last per group. You may want to use NULLS LAST for descending sort order (not for COUNT(*), which is never NULL):

  • One more subtle detail you need to be aware of: tenth and twentieth can all be NULL in the result for two different reasons, if SentDate can be NULL in the underlying table. A NULL value for tenth in the result can mean there are less than 10 distinct values for the emailaddress or it can mean that NULL is at the 10th position according to your sort order.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I still get the same issue with this query. Glad it can be shortened though. I still dont see "null" in any of the columns. It only selects email addresses that have minimum of 20 rows if that makes sense – Chris Hall Jan 19 '15 at 11:16
  • @ChrisHall: Did you test my latest version? My first draft still had the WHERE conditions force-converting the LEFT JOIN to a plain JOIN. The current version definitely includes *all* email-addresses with at least 5 rows (after grouping!!). You did want to number rows ***after*** grouping, right? Because that's what you are doing now. – Erwin Brandstetter Jan 19 '15 at 11:21
  • @ChrisHall: I added some more pointers. There are tricky aspects to this query. – Erwin Brandstetter Jan 19 '15 at 11:49