0

I am working with historical price data of many different products.

I have a table products that lists all products with their name, description etc., identifying them with an uuid. Then there is another table history that stores every price change ever made for every product. The prices may (usually do) change quite a few times per day.

Now I want to calculate what each product's price was at a specific point in time, say 14th March 2015 at 12 noon. How can I do this in SQL?

I am able to do this for one product:

SELECT product_id, price, date 
FROM history 
WHERE product_id = 'aa6d9976-e9ae-4478-486e-097e86c1e5fe' 
AND (date-'2015-03-14 12:00:00+02') < interval '1 second' 
ORDER BY diff DESC LIMIT 1

  ->   aa6d9976-e9ae-4478-486e-097e86c1e5fe    109     2015-03-14 11:55:00+01

But I would like ALL products in one query in one set. My idea was to get all the products and LEFT JOIN that table with the history, selecting the appropriate price for each, but I fail at the latter:

SELECT products.product_id, name, price, date 
FROM products 
  LEFT JOIN history ON products.product_id = history.product_id 
WHERE date "is the greatest value that is still somewhat smaller than" '2015-03-14 12:00:00+01'

How do you write properly what I tried to express in the quotes?

I use PostgreSQL (although I have mainly worked with MySQL before). The tables are about 15 thousand (products) and 50 million (history) rows respectively.


Some sample data if you like some:

PRODUCTS

product_id                              name

aa6d9976-e9ae-4478-486e-097e86c1e5fe    One
8da97d50-540e-4fdb-d032-7f443a9869a0    Two
b51654ea-6190-4ed2-5e23-7075ffd3b472    Three


HISTORY

id  product_id                              price   date

1   aa6d9976-e9ae-4478-486e-097e86c1e5fe    100     2015-03-14 09:30:00+01
2   aa6d9976-e9ae-4478-486e-097e86c1e5fe    110     2015-03-14 10:48:00+01
3   b51654ea-6190-4ed2-5e23-7075ffd3b472    9       2015-03-14 11:01:00+01
4   8da97d50-540e-4fdb-d032-7f443a9869a0    49      2015-03-14 11:27:00+01
5   aa6d9976-e9ae-4478-486e-097e86c1e5fe    109     2015-03-14 11:55:00+01
6   b51654ea-6190-4ed2-5e23-7075ffd3b472    8       2015-03-14 13:59:00+01
7   aa6d9976-e9ae-4478-486e-097e86c1e5fe    110     2015-03-14 16:10:00+01
8   8da97d50-540e-4fdb-d032-7f443a9869a0    48      2015-03-14 19:34:00+01
9   8da97d50-540e-4fdb-d032-7f443a9869a0    49      2015-03-14 23:30:00+01
10  aa6d9976-e9ae-4478-486e-097e86c1e5fe    103     2015-03-14 23:33:00+01


DESIRED OUTPUT

id                                      name    price   date

aa6d9976-e9ae-4478-486e-097e86c1e5fe    One     109     2015-03-14 11:55:00+01
8da97d50-540e-4fdb-d032-7f443a9869a0    Two     49      2015-03-14 11:27:00+01
b51654ea-6190-4ed2-5e23-7075ffd3b472    Three   9       2015-03-14 11:01:00+01
Christallkeks
  • 525
  • 5
  • 18
  • This model is probably much easier (and more efficient) if you store start **and** end date for each price (e.g. as a `tsrange`). You can ensure non-overlapping price intervals using an exclusion constraint. This might make inserts a bit more complicated/slower, but if read performance is more important, you should take that into consideration. The query would be as simple as `select * from history where product_id = '...' and valid_during @> timestamp '2015-03-14 12:00:00+02'` –  Jan 06 '16 at 23:03
  • It's not my structure, I have to deal with it as it is... :/ Any ideas? – Christallkeks Jan 07 '16 at 00:07
  • @a_horse_with_no_name - If I were doing that, I'd probably make the `end` column calculated (maybe MQT or view) or triggered, and just insert as normal. Much harder to mess up later. – Clockwork-Muse Jan 07 '16 at 00:19
  • @Clockwork-Muse: true, you can update the end column in an insert trigger if a new price is added. But with an exclusion constraint you can at least be sure you will never get overlapping intervals. –  Jan 07 '16 at 00:20
  • @a_horse_with_no_name - yeah, which is why I'd probably MQT it - guaranteed no overlaps then, and prices change "infrequently". In any case, trivial variant of [tag:greatest-n-per-group] problems - existing [answers](http://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group) should work just fine. – Clockwork-Muse Jan 07 '16 at 00:32

2 Answers2

0

First you write a query to find the maximum date that is less than the date you are querying for each product. That would look something like this:

    select product_id, MAX(date) date 
    from history
    where date < '3/14/2015 12:00:00'
    group by product_id

Then you can join that sub-query with your products and history tables to get the results you want:

select products.*, history.price, history.date
from products
left join
    (
    select product_id, MAX(date) date 
    from history
    where date < '3/14/2015 12:00:00'
    group by product_id
    ) PriceDates
on products.product_id = PriceDates.product_id
join history
on PriceDates.product_id = history.product_id
    and PriceDates.date = history.date
Brian Pressler
  • 6,653
  • 2
  • 19
  • 40
0

Use the window function lead() to find the next corresponding record for a given product_id (BTW: I renamed date to zdate. date is a bad name for a column, since it is the name for a datatype)

SELECT h0.* 
FROM history h0
JOIN (
    SELECT  id
    , zdate AS start_date
    , lead(zdate, 1, 'infinity' ) OVER (PARTITION BY product_id
                                        ORDER BY zdate) AS end_date
    FROM history
    ) h1 ON h0.id = h1.id
    AND h1.start_date <= '2015-03-14 12:00:00+01'
    AND h1.end_date > '2015-03-14 12:00:00+01'
    ;

An index on {product_id , zdate} will probably help ;-)


Result:

 id |              product_id              | price |        zdate        
----+--------------------------------------+-------+---------------------
  4 | 8da97d50-540e-4fdb-d032-7f443a9869a0 |    49 | 2015-03-14 11:27:00
  5 | aa6d9976-e9ae-4478-486e-097e86c1e5fe |   109 | 2015-03-14 11:55:00
  3 | b51654ea-6190-4ed2-5e23-7075ffd3b472 |     9 | 2015-03-14 11:01:00
(3 rows)
wildplasser
  • 43,142
  • 8
  • 66
  • 109