Redshift PostgreSQL Distinct ON Operator

Question

I have a data set that I want to parse for to see multi-touch attribution. The data set is made up by leads who responded to a marketing campaign and their marketing source.

Each lead can respond to multiple campaigns and I want to get their first marketing source and their last marketing source in the same table.

I was thinking I could create two tables and use a select statement from both. The first table would attempt to create a table with the most recent marketing source from every person (using email as their unique ID).

create table temp.multitouch1 as (
select distinct on (email) email, date, market_source as last_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE DESC);

Then I would create a table with deduped emails but this time for the first source.

create table temp.multitouch2 as (
select distinct on (email) email, date, market_source as first_source 
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE ASC);

Finally I wanted to simply select the email and join the first and last market sources to it each in their own column.

select a.email, a.last_source, b.first_source, a.date 
from temp.multitouch1 a
left join temp.multitouch b on b.email = a.email

Since distinct on doesn't work on redshift's postgresql version I was hoping someone had an idea to solve this issue in another way.

EDIT 2/22: For more context I'm dealing with people and campaigns they've responded to. Each record is a "campaign response" and every person can have more than one campaign response with multiple sources. I'm trying make a select statement which would dedupe by person and then have columns for the first campaign/marketing source they've responded to and the last campaign/marketing source they've responded to respectively.

EDIT 2/24: Ideal output is a table with 4 columns: email, last_source, first_source, date.

The first and last source columns would be the same for people with only 1 campaign member record and different for everyone who has more than 1 campaign member record.

According to this page in AWS docs I am: http://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html — Berra2k, Feb 22 '16 at 18:30

Paul Maxwell · Accepted Answer · 2016-02-25T01:12:51.503

I believe you could use row_number() inside case expressions like this:

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
      FROM sf.campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email

tested here: SQL Fiddle

PostgreSQL 9.3 Schema Setup:

CREATE TABLE campaignmember
    (email varchar(3), date timestamp, market_source varchar(1))
;

INSERT INTO campaignmember
    (email, date, market_source)
VALUES
    ('a@a', '2016-01-02 00:00:00', 'x'),
    ('a@a', '2016-01-03 00:00:00', 'y'),
    ('a@a', '2016-01-04 00:00:00', 'z'),
    ('b@b', '2016-01-02 00:00:00', 'x')
;

Query 1:

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
      FROM campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email

Results:

| email | first_source |                first_date | last_source |                 last_date |
|-------|--------------|---------------------------|-------------|---------------------------|
|   a@a |            x | January, 02 2016 00:00:00 |           z | January, 04 2016 00:00:00 |
|   b@b |            x | January, 02 2016 00:00:00 |           x | January, 02 2016 00:00:00 |

& a small extension to the request, count the number of contact points.

SELECT
      email
    , MIN(first_source) AS first_source
    , MIN(date) first_date
    , MAX(last_source) AS last_source
    , MAX(date) AS last_date
    , MAX(numof) AS Numberof_Contacts 
FROM (
      SELECT
            email
          , date
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
                  ELSE NULL
            END AS first_source
          , CASE
                  WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
                  ELSE NULL
            END AS last_source
          , COUNT(*) OVER (PARTITION BY email) as numof
      FROM campaignmember
      WHERE date >= '2016-01-01'
      ) s
WHERE first_source IS NOT NULL
      OR last_source IS NOT NULL
GROUP BY
      email

score -1 · Answer 2 · answered Feb 22 '16 at 18:31

-1

You can use the good old left join groupwise maximum.

SELECT DISTINCT c1.email, c1.date, c1.market_source
FROM sf.campaignmember c1
  LEFT JOIN sf.campaignmember c2 
    ON c1.email = c2.email AND c1.date > c2.date AND c1.id > c2.id
  LEFT JOIN sf.campaignmember c3
    ON c1.email = c3.email AND c1.date < c3.date AND c1.id > c3.id
WHERE c1.date >= '1/1/2016' AND c2.date >= '1/1/2016'
      AND (c2.email IS NULL OR c3.email IS NULL)

This assumes you have an unique id column, if (date, email) is unique id is not needed.

answered Feb 22 '16 at 18:31

Jakub Kania

15,665
2
37
47

Wouldn't I want to select c2.market_source and then c3.market_source instead of just using c1.market_source? Also the trouble is that some people have multiple market_source records because they've responded to multiple campaigns, others do not. – Berra2k Feb 22 '16 at 18:55
@Berra2k You want to select the records from c1 that don't have older records then them (c2) or don't have younger records (c3). If there will be only one record for the email then it still will be returned. Please try and see. – Jakub Kania Feb 22 '16 at 19:08

Redshift PostgreSQL Distinct ON Operator

2 Answers2