0

I'm using AWS Athena, I have a query as such:

SELECT * FROM foo ORDER BY purchase_date ASC

But I want to de-dupe the records.

Since this is Athena, and it needs to process records in parallel, I'm not sure how to write the DISTINCT clause.

How can I make this query so that it doesn't have any duplicate records in the result set?

Thanks

shuvalov
  • 4,713
  • 2
  • 20
  • 17
  • just use `select distinct c1,c2...from tbl` – Vamsi Prabhala Mar 13 '20 at 19:49
  • @VamsiPrabhala that doesn't work because the nature of AWS Athena is such that it's streaming records and becuase I have an ORDER BY clause, it somehow needs all the records... anyway let me try what you're saying thank you very much – Dustin Anderson Mar 13 '20 at 19:52
  • that implementation detail you shouldn't worry about – Vamsi Prabhala Mar 13 '20 at 19:55
  • @VamsiPrabhala so I did SELECT DISTINCT foo, bar from tbl but I got only the columns foo and bar in the result set. How to get all columns? – Dustin Anderson Mar 13 '20 at 20:57
  • 1
    This seems to work: SELECT * FROM ( SELECT *, row_number() over (partition by foo) as row_number FROM tbl ) AS ROWS WHERE row_number = 1 order by bar which I got from: https://stackoverflow.com/questions/12693089/pgerror-select-distinct-order-by-expressions-must-appear-in-select-list – Dustin Anderson Mar 13 '20 at 21:53

0 Answers0