0

I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:

event_count     sys_timestamp

100             167877672772
110             167877672769
121             167877672987
111             167877673877
...             ...

With both fields defined as numeric.

With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:

SELECT t1.*,

    (SELECT COUNT(*) FROM tableA t2 
        WHERE t2.sys_timestamp > t1.sys_timestamp AND 
        t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
        t2.event_count >= t1.event_count+10)
    AS positive, 

    (SELECT COUNT(*) FROM tableA t2 
       WHERE t2.sys_timestamp > t1.sys_timestamp AND 
       t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
       t2.event_count <= t1.event_count-10) 
    AS negative 

FROM tableA as t1

The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).

However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.

Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.

My concrete questions:

  1. How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?

     (SELECT COUNT(*) FROM tableA t2 
      WHERE t2.sys_timestamp >  t1.sys_timestamp
      AND   t2.sys_timestamp <= t1.sys_timestamp + 1000)
    
  2. Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)

  3. Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?

  4. Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Peter K.
  • 517
  • 3
  • 19
  • 1
    Sample data and desired results would really help. – Gordon Linoff Apr 07 '21 at 21:39
  • It's so much easier if you provide (the relevant parts of) table definitions (`CREATE TABLE` statements) showing data types and constraints, existing indexes and some sample data. [Consider instructions for performance questions here.](https://stackoverflow.com/tags/postgresql-performance/info) – Erwin Brandstetter Apr 08 '21 at 00:09

2 Answers2

2

How can I reuse the common portion of the subquery ...?

Use conditional aggregates in a single LATERAL subquery:

SELECT t1.*, t2.positive, t2.negative
FROM   tableA t1
CROSS  JOIN LATERAL (
   SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
        , COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
   FROM   tableA t2 
   WHERE  t2.sys_timestamp >  t1.sys_timestamp
   AND    t2.sys_timestamp <= t1.sys_timestamp + 1000
   ) t2;

It can be a CROSS JOIN because the subquery always returns a row. See:

Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:

event_count should probably be integer or bigint. See:

sys_timestamp should probably be timestamp or timestamptz. See:

An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.

Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

You have the right idea. The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.

The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.

Then the next step is to reuse that same detail query for everything else.

Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:

with baseQuery as
(
   SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount 
   ,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
   FROM tableA t1, tableA t2 
   where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
   select bq.startEventCount, count(*) as positive
   from baseQuery bq
   where t2EventCount between bq.startEventCount and bq.pEndEventCount
   group by bq.startEventCount 
), negSummary as
(
   select bq.startEventCount, count(*) as negative
   from baseQuery bq
   where t2EventCount between bq.startEventCount and bq.nEndEventCount
   group by bq.startEventCount 
)
select t1.*, ps.positive, nv.negative
from tableA t1 
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount

Notes:

  1. The distinct for baseQuery may not be necessary based on your actual keys.
  2. The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.

You can play around to see what works.

There are other ways of course but this best illustrates how and where things could be improved.

With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion. For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.

Chuma
  • 714
  • 3
  • 7