5

I have a table tab1 in timescale db which has 3 columns tag, time, value. time and tag make up the pk for the table: (time, tag).

There are more than 500 lakhs (50 000 000) of rows. I need to find the latest timestamp or the max(time) for each of the N tags.

There are few things that I tried, I'll share my experience with each:

  1. inner query
SELECT "time", "tag", "value"
FROM tab1 
WHERE ("tag","time") IN 
(SELECT "tag", MAX("time") FROM tab1 WHERE "tag" IN(tag1,tag2) GROUP BY "tag" );

this is giving result but taking around 19 sec to execute which is beyond the acceptable limit

  1. using last function from timescale db
SELECT tag, last(time, time), last(value,time)
FROM tab1
WHERE "tag" IN (tag1,tag2) GROUP BY "tag" ;

this is giving output within 10 sec.

I need to find another feasible solution similar to 2nd one that is perhaps performing better. I tried few other things like LATERAL JOIN (3), WINDOW FUNCTIONS (ROW_NUMBER, PARTITION) (4) but the solution is not as expected.

  1. using lateral gives cross of multiple columns and not one single value with max time as expected. Moreover it is taking 15sec to execute but it might be due to wrong query.
SELECT table1."tag", table1."time",table1."value" from tab1 as table1
join lateral (
    SELECT table2 ."tag",table2 ."time" from tab1 as table2   
    where table2."tag" = table1."tag"
    order by table2."time" desc limit 1
) p on true
where table1."tag" in (tag1,tag2)
  1. when trying partition I want to put limit 1 like this: (partition by tag order by time desc limit 1) but it is giving syntax error. without limit 1 I am not getting the latest time.
SELECT * from 
( SELECT *, row_number() over (partition by tag order by time desc) as rownum
from  tab1) a
where tag in (tag1,tag2)

Can anyone suggest what's wrong with the queries in 3,4 or suggest if there is another alternative.

Index for my table is: enter image description here

CuriousCurie
  • 113
  • 10
  • do you have an index on time desc, tag? If not, create it and test first query again – Horaciux Jan 12 '23 at 16:53
  • Indexing is present for both tags and time @Horaciux – CuriousCurie Jan 12 '23 at 21:36
  • could you show it, just to be certain? I need to see if it is one index in the right order, or not. Two separated indexes do not cut it, wrong order either. – Horaciux Jan 12 '23 at 21:48
  • @Horaciux, I have edited the question to address the index of the table. Also, pls note that I have tried with the sort order of index as desc but didnt notice any significant improvement. – CuriousCurie Jan 13 '23 at 16:12
  • 1
    The index needs to be on tag, time desc, it can't be in the other order. Going to respond below. And to be clear this needs to be a *single index* on *multiple columns* you can create the index by doing CREATE INDEX on tab1(tag, time desc); – davidk Jan 13 '23 at 20:06

3 Answers3

5

There are a couple things that are going to help this and make this query easier and more performant. The first, and probably most important one is the index on the table / hypertable - it will need to be a multi-column index on tag, time desc - the order of the time doesn't matter all that much, but the order of columns in the index matters a lot. tag has to be the first column here because we need to first search by tag, then get the latest time, if we have either separate indexes or if we're ordered by time first, this will be highly inefficient.

You can create this index with a call like this:

CREATE INDEX ON tab1 (tag, "time" DESC);

The next thing is the formulation of the query. The simplest way to get this for each tag is to write a DISTINCT ON query. In Timescale we've optimized this sort of query. It's a bit of a weird way to formulate it so can be a bit hard to find though.

Basically you'd write it something like:

SELECT DISTINCT ON (tag) tag, "time" FROM tab1 ORDER BY tag, "time" DESC;

And that should give you what you want. It's a bit weird, but it will work!

I'm not going to go through the other approaches, but most of them will be improved dramatically with the index, but this is still likely going to be the most performant approach.

If you want to, please post a comment with how this worked and if it sped things up for you!

davidk
  • 1,003
  • 5
  • 9
  • Thanks a lot for the answer. I'll try this out and share the updates with you. Coming to the indexing thing, we had tried with changing it from asc to desc but didn't see lot of improvement. But as you suggested will try out by prioritising tag column. – CuriousCurie Jan 15 '23 at 17:16
  • 1
    This worked brilliantly. Got the response within 2 sec even with more than 15 tags. We have updated the index as well. For others who need to understand how this query works, please go through the link in the answer then to [this answer](https://stackoverflow.com/questions/50846722/what-is-the-difference-between-postgres-distinct-vs-distinct-on) that describes the difference between distinct and distinct on. – CuriousCurie Jan 16 '23 at 10:10
2

I want to add something to David's excellent answer here, and that is an understanding of why the column order is important for your query.

Your primary key is effectively a multi-column b-tree index. This means that to use the index, the query has to traverse the time column first and then can check the tag column. In your case, this isn't very helpful to you. What you want to be able to to instead is traverse the tag first, and then grab the most recent time.

To do this, you have to have the tag first on your btree list. I don't know if asc or desc is going to make a big difference here since PostgreSQL can search an index in either direction and your direction on scan on time is not dependent on your direction of scanning tags. However the Timescale optimizations for skip scans might, so best to follow that advice.

  • I think they had two separate indexes one on tags and one on time before, I think they have the multi-column index now with the correct ordering, but not 100% sure! – davidk Jan 16 '23 at 16:01
  • But this is an important point about ordering in Btrees! – davidk Jan 16 '23 at 16:27
  • @davidk yes there are cases where it makes a big difference. However, since PostgreSQL can scan btrees in either direction, the cases where you have a big impact may be fewer. For example, your point regarding skip scans means that if you have to scan two different keys in different directions you will have a problem if these are not specified in the index ordering. – Chris Travers Jan 17 '23 at 01:19
  • On the other hand, if both tag and time were descending, in the query, then the index requirement is that they both be the same direction (since you can scan forward with both ascending, or backwards with both descending if they are both, say, ascending). I would actually recommend that if you specify the direction of one column on the index, it is clearer to specify both so you are explicit in their relationship.h – Chris Travers Jan 17 '23 at 03:23
  • On the question of the original index, the primary key specified both columns, indicating a multi-column b-tree. – Chris Travers Jan 17 '23 at 03:25
  • Just to clarify, index and pk are both multi column (tag,time) – CuriousCurie Jan 17 '23 at 06:05
0

So figured out the other 2 approaches:

  1. Using lateral

SELECT distinct t_outer.tag, t_top.time, t_top.value from tab1 t_outer join lateral ( SELECT * from tab1 t_inner where t_inner.tag = t_outer.tag order by t_inner.time desc limit 1 ) t_top on true where t_outer.tag in (tag1)

This works but takes more than 14 sec to process.

  1. Using Window functions

SELECT * FROM (SELECT tag,time,"value", rank() OVER (PARTITION BY tag order by time desc ) as RN FROM tab1 WHERE tag IN(tag1) ) as results WHERE results.RN=1;

This also works and takes around 9 sec to process.

When comparing the results inner query and lateral are worst performing queries even for a single tag. So they are eliminated.

Now Last() and Partition() query are our top contenders, till now. Last() is performing better if fewer columns are fetched, otherwise if we fetch all the columns, the execution time of both of them is comparable.

One more point that I would like to add is that using ORDER BY LIMIT query is performing way way better than all the options (execution time is less than 1 sec for single tag) however a drawback is that it wont work for multiple tag input. So in case you need to query for single tag and have similar db configs, this is something you can try.

CuriousCurie
  • 113
  • 10