3

I have a situation which I think can be compared to services like CamelCamelCamel, Keepa and so on. Lets say I track the price of an article on each day for a couple of countries. So my table, lets call it Trend, would look something like this

Id     Created      ArticleId    Country    Price
-------------------------------------------------
01     19/11/05     452          US         45.90
02     19/11/05     452          CA         52.99
03     19/11/05     452          MX         99.99
04     19/11/06     452          US         20.00
05     19/11/06     452          CA         25.00
06     19/11/06     452          MX         50.00
...                
97     19/11/05     738          US         12.99
98     19/11/05     738          CA         17.50
99     19/11/05     738          MX         45.50

So it's the next day and I want to update the Trend table. If the price in a country is still the same, I skip the article/country combination. If there is a new price I'll add a new record.

Now I want to query the table to get each ArticleId / Country combination. But only the last record of it (orderd by timestamp). So taken the example above I'd expect to get the records 04, 05 and 06 for ArticleId 452. Not 01, 02 and 03

So I start out with this basic query. But how do I get to change it to get my expected results?

SELECT
    *
FROM
    Trend
ORDER BY 
    Created DESC
GMB
  • 216,147
  • 25
  • 84
  • 135
boop
  • 7,413
  • 13
  • 50
  • 94

2 Answers2

2

One method uses a correlated subquery for filtering:

select t.*
from trend t
where t.created = (
    select max(t1.created) 
    from trend t1
    where t1.articleId = t.articleId and t1.country = t.country
)

For performance, you want an index on (articleId, country, created).

You might also want to consider the anti-left join approach:

select t.*
from trend t
left join trend t1 
    on  t1.articleId = t.articleId 
    and t1.country = t.country
    and t1.created > t.created
where t1.articleId is null

Finally, another typical solution is to join the table with an aggregate query:

select t.*
from trend t
inner join (
    select articleId, country, max(created) created
    from trend
    group by articleId, country
) t1 
    on  t1.articleId = t.articleId 
    and t1.country = t.country
    and t1.created = t.created

Which solution performs better depends on the size and distribution of your data.

GMB
  • 216,147
  • 25
  • 84
  • 135
  • Can you give an explanation which solution performs best in which scenario? A single day is about 100k of records. So the table is going to grow quickly. – boop Feb 03 '20 at 20:33
  • @boop: it is hard to tell before hand, you shoud really test each solution against your data. If you have a large number of rows with a small number of groups (article/country tuples), the correlated subquery might offer good performance, but that's just a guess. – GMB Feb 03 '20 at 20:38
  • @boop . . . The correlated subquery with the right indexes often has the best performance or close to it. – Gordon Linoff Feb 03 '20 at 23:31
1

You can do this with a combination of DISTINCT and CROSS APPLY.

SELECT DISTINCT ca.Id, ca.Created, t.ArticleId, t.Country, ca.Price
FROM Trend t
CROSS APPLY (SELECT TOP 1 Id, Created, Price
             FROM Trend
             WHERE ArticleId = t.ArticleId AND Country = t.Country
             ORDER BY Created DESC) ca

Often times when writing queries using APPLY your joining columns (ArticleId and Country) make up a unique key on another table. If this applies to your database you can drop the DISTINCT and speed up the query.

SELECT ca.Id, ca.Created, a.ArticleId, a.Country, ca.Price
FROM Article a
CROSS APPLY (SELECT TOP 1 Id, Created, Price
             FROM Trend
             WHERE ArticleId = a.ArticleId AND Country = a.Country
             ORDER BY Created DESC) ca

Lastly, if you're having issues with performance, you may want to create an index.

CREATE NONCLUSTERED INDEX [NC_Trend_ArticleId] ON [Trend]
(
    [ArticleId] ASC,
    [Country] ASC,
    [Created] ASC
)
INCLUDE ([Price])

Presumably Id is a PRIMARY KEY and is already covered by a CLUSTERED INDEX, if so, the above should be appropriate for most solutions.

Derrick Moeller
  • 4,808
  • 2
  • 22
  • 48
  • Thank you. This seems to work just fine. Although this takes roughly about 5 times as long as a `SELECT * FROM Trend` with approx. 100k records. Can you think of another approach to get the same results? – boop Feb 03 '20 at 20:24
  • 1
    @boop It's always going to take longer than `SELECT *` but it shouldn't take excessively long given a dataset that small. Do you have a covering index? – Derrick Moeller Feb 03 '20 at 20:25