5

I have a table of transactional data, which are forecasts for the future. The same forecasts, identified by same date, type, location and product, are thus read in multiple times as the forecasts become more accurate as time goes by and are resent.

I would like to create a query that would group the transactions that are of the same type and for the same location, product and date and then select from these groups only the ones that have the newest updated timestamps.

The table has now hundreds of thousands of rows and as time goes by, millions, so a reasonably efficient solution would be appreciated :)

Example table:

date    |  location_code   | product_code  | quantity |   type   | updated_at 
------------+------------------+---------------+----------+----------+------------
2013-02-04 | ABC | 123 |  -26.421 | TRANSFER | 2013-01-12
2013-02-07 | ABC | 123 |    -48.1 | SALE | 2013-01-10
2013-02-06 | BCD | 234 |  -58.107 | SALE | 2013-01-11
2013-02-06 | BCD | 234 |      -60 | SALE | 2013-01-10
2013-02-04 | ABC | 123 |   -6.727 | TRANSFER | 2013-01-10

The desired result:

date    |  location_code   | product_code  | quantity |   type   | updated_at 
------------+------------------+---------------+----------+----------+------------
2013-02-04 | ABC | 123 |  -26.421 | TRANSFER | 2013-01-12
2013-02-07 | ABC | 123 |    -48.1 | SALE | 2013-01-10
2013-02-06 | BCD | 234 |  -58.107 | SALE | 2013-01-11

I tried for example:

SELECT t.date, t.location_code, t.product_code, t.quantity, t.type, t.updated_at
FROM transactions t
INNER JOIN
(
   SELECT MAX(updated_at) as max_updated_at
   FROM transactions
   GROUP BY product_code, location_code, type, date
) s on t.updated_at=max_updated_at; 

But this seems to take ages and doesn't seem to work.

Thank you for the help!

jesseniem
  • 177
  • 3
  • 10

3 Answers3

4
select distinct on ("date", location_code, product_code, type)
    "date",
    location_code,
    product_code,
    quantity,
    type,
    updated_at
from transactions t
order by t."date", t.location_code, t.product_code, t.type, t.updated_at desc
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
  • Tried this one as well but got the following error: `ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: select distinct on (date, location_code, product_code, type)` – jesseniem Mar 16 '13 at 22:49
  • Thanks! Tested this one and it would seem to be the most efficient solution so far: `$ cat time3 Sun Mar 17 01:06:50 EET 2013 Sun Mar 17 01:06:53 EET 2013 | Sun Mar 17 01:06:54 EET 2013 Sun Mar 17 01:06:57 EET 2013 | Sun Mar 17 01:06:58 EET 2013 Sun Mar 17 01:07:02 EET 2013` – jesseniem Mar 16 '13 at 23:13
2

This is probably more efficient than the join with the derived table

select *
from (
    select date, 
           location_code, 
           product_code, 
           quantity, 
           type, 
           updated_at, 
           max(updated_at) over (partition by product_code, location_code, type, date) as max_updated
    from transactions
) t
where updated_at = max_updated;
  • Ran a fast very unscientific performance test, which showed negligible performance difference. Derived table approach is time2, this version time1: `$cat time1 Sun Mar 17 00:57:09 EET 2013 Sun Mar 17 00:57:13 EET 2013 | Sun Mar 17 00:57:15 EET 2013 Sun Mar 17 00:57:20 EET 2013 | Sun Mar 17 00:57:23 EET 2013 Sun Mar 17 00:57:29 EET 2013 $ cat time2 Sun Mar 17 00:55:45 EET 2013 Sun Mar 17 00:55:49 EET 2013 | Sun Mar 17 00:56:06 EET 2013 Sun Mar 17 00:56:11 EET 2013 | Sun Mar 17 00:56:14 EET 2013 Sun Mar 17 00:56:18 EET 2013 ` – jesseniem Mar 16 '13 at 23:04
1

Thanks Dan Bracuk!

This was the correct query:

SELECT t.date, t.location_code, t.product_code, t.quantity, t.type, t.updated_at
FROM transactions t
INNER JOIN
(
   SELECT MAX(updated_at) as max_updated_at, product_code prod, location_code loc, type     typ, date dat
   FROM transactions
   GROUP BY product_code, location_code, type, date
   ) s ON t.updated_at=max_updated_at AND t.location_code=loc AND t.product_code=prod AND  t.type=typ AND t.date=dat;
jesseniem
  • 177
  • 3
  • 10
  • 1
    Using a window function is probably more efficient. In this case , Clodoaldo Neto's `distinct on` solution is probably the most efficient one. –  Mar 16 '13 at 22:50