1

Having a de-normalized structure in Redshift and plan is to keep creating records and while retrieving only consider most recent attributes against users.

Following is the table:

user_id   state  created_at
1         A      15-10-2015 02:00:00 AM
2         A      15-10-2015 02:00:01 AM
3         A      15-10-2015 02:00:02 AM
1         B      15-10-2015 02:00:03 AM
4         A      15-10-2015 02:00:04 AM
5         B      15-10-2015 02:00:05 AM

And required result set is:

user_id   state  created_at
2         A      15-10-2015 02:00:01 AM
3         A      15-10-2015 02:00:02 AM
4         A      15-10-2015 02:00:04 AM

I have the query which retrieve the said result:

select user_id, first_value AS state
from (
   select user_id, first_value(state) OVER (
                     PARTITION BY user_id
                     ORDER BY created_at desc
                     ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
   from customer_properties
   order by created_at) t
where first_value = 'A'

Is this the best way to retrieve or can the query be improved?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Gul
  • 1,757
  • 2
  • 15
  • 26
  • created_at column is getting used in the query and was missing in the sample data, question has been updated. – Gul Oct 15 '15 at 02:23

1 Answers1

0

The best query depends on various details: selectivity of the query predicate, cardinalities, data distribution. If state = 'A' is a selective condition (view rows qualify), this query should be substantially faster:

SELECT c.user_id, c.state
FROM   customer_properties c
LEFT   JOIN customer_properties c1 ON c1.user_id = c.user_id
                                  AND c1.created_at > c.created_at
WHERE  c.state = 'A'
AND    c1.user_id IS NULL;

Provided, there is an index on (state) (or even (state, user_id, created_at)) and another one on (user_id, created_at).

There are various ways to make sure a later version of the row does not exist:

If 'A' is a common value in state, this more generic query will be faster:

SELECT user_id, state
FROM (
   SELECT user_id, state
        , row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
   FROM   customer_properties
   ) t
WHERE  t.rn = 1
AND    t.state = 'A';

I removed NULLS LAST, assuming that created_at is defined NOT NULL. Also, I don't think Redshift has it:

Both queries should work with the limited functionality of Redshift. With modern Postgres, there are better options:

Your original would return all rows per user_id, if the latest row matches. You would have to fold duplicates, needless work ...

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228