Optimising function which extracts records with a minimum gap in timestamps

Question

I have a big table of timestamps in Postgres 9.4.5:

CREATE TABLE vessel_position (
  posid serial NOT NULL,
  mmsi integer NOT NULL,
  "timestamp" timestamp with time zone,
  the_geom geometry(PointZ,4326),
  CONSTRAINT "PK_posid_mmsi" PRIMARY KEY (posid, mmsi)
);

Additional index:

CREATE INDEX vessel_position_timestamp_idx ON vessel_position ("timestamp");

I want to extract every row where the timestamp is at least x minutes after the previous row. I've tried a few different SELECT statements using LAG() which all kind of worked, but didn't give me the exact result I require. The below functions gives me what I need, but I feel it could be quicker:

CREATE OR REPLACE FUNCTION _getVesslTrackWithInterval(mmsi integer, startTime character varying (25) ,endTime character varying (25), interval_min integer)
RETURNS SETOF vessel_position AS
$func$
DECLARE
    count integer DEFAULT 0;
    posids varchar DEFAULT '';
    tbl CURSOR FOR
    SELECT
      posID
      ,EXTRACT(EPOCH FROM (timestamp -  lag(timestamp) OVER (ORDER BY posid asc)))::int as diff
    FROM vessel_position vp WHERE vp.mmsi = $1  AND vp.timestamp BETWEEN $2::timestamp AND $3::timestamp;
BEGIN
FOR row IN tbl
LOOP
    count := coalesce(row.diff,0) + count;
    IF count >= $4*60 OR count = 0 THEN
            posids:= posids || row.posid || ',';
            count:= 0;
     END IF;
END LOOP;
RETURN QUERY EXECUTE 'SELECT * from vessel_position where posid in (' || TRIM(TRAILING ',' FROM posids) || ')';
END
$func$ LANGUAGE plpgsql;

I can't help thinking getting all the posids as a string and then selecting them all again at the very end is slowing things down. Within the IF statement, I already have access to each row I want to keep, so could potentially store them in a temp table and then return temp table at the end of the loop.

Can this function be optimised - to improve performance in particular?

[About the outdated Postgres version 9.4.5](https://www.postgresql.org/support/versioning/): `We always recommend that all users run the latest available minor release for whatever major version is in use.` — Erwin Brandstetter, May 17 '19 at 16:16

Erwin Brandstetter · Accepted Answer · 2019-05-17T17:01:41.570

2

Query

Your function has all kinds of expensive, unnecessary overhead. A single query should be many times faster, doing the same:

CREATE OR REPLACE FUNCTION _get_vessel_track_with_interval
 (mmsi int, starttime timestamptz, endtime timestamptz, min_interval interval)
  RETURNS SETOF vessel_position AS
$func$
BEGIN
   SELECT (vp).*  -- parentheses required for decomposing row type
   FROM  (
      SELECT vp   -- whole row (!)
           , timestamp - lag(timestamp) OVER (ORDER BY posid) AS diff
      FROM   vessel_position vp
      WHERE  vp.mmsi = $1
      AND    vp.timestamp >= $2     -- typically you'd include the lower bound
      AND    vp.timestamp <  $3;    -- ... and exlude the upper
      ORDER  BY posid
      ) sub
   WHERE  diff >= $4;
END
$func$  LANGUAGE plpgsql STABLE;

Could also just be an SQL function or the bare SELECT without any wrapper (Maybe as prepared statement? Example.)

Note how starttime and endtime are passed as timestamp. (Makes no sense to pass as text and cast.) And the minimum interval min_interval is an actual interval. Pass any interval of your choosing.

Index

If the predicate on mmsi is in any way selective, the two indexes you currently have (PK ON (posid, mmsi) and idx on (timestamp)) are not very useful. If you reverse the column order of your PK to (mmsi, posid), it becomes far more useful for the query at hand. See:

Is a composite index also good for queries on the first field?

The optimal index for this would typically be on vessel_position(mmsi, timestamp). Related:

Aside: Avoid keywords as identifiers. That's asking for trouble. Plus, a column timestamp that actually holds timestamptz is misleading.

edited May 17 '19 at 17:01

answered May 16 '19 at 22:53

Erwin Brandstetter

605,456
145
1,078
1,228

@wildplasser: nope :) I select *the whole row* in the subquery, and decompose it in the outer `SELECT` to match the return type declared as `SETOF vessel_position` without spelling out all columns. (To get rid of the column `diff` added to the row in the subquery.) – Erwin Brandstetter May 16 '19 at 23:16
I see. (kind of) The `sub` is only syntactic sugar. – wildplasser May 16 '19 at 23:34
Thanks for the answer, I appreciate it does answer the question I initially asked (happy to accept it as answered), but I realised I missed out one small, but important detail (sorry about that!) which I have now added in the edit above. I have implemented some of your changes i.e passing a timestamp rather than char, and formatting the WHERE differently so hopefully that has sped it up a bit. – BStone May 17 '19 at 09:30
@BStone: I rolled back your edits *changing* the question (but kept the useful additions). You can find everything in the [edit history](https://stackoverflow.com/posts/56172750/revisions). I suggest you start a new question for your changed objective. (I might have a couple of ideas.) You can always link to this one for context - and drop a comment here to link back and get my attention. – Erwin Brandstetter May 17 '19 at 16:20
Also note the change from `timestamp` -> `timestamptz` after you disclosed the table definition. Else the current time zone setting can change results. It is important to understand implications: https://stackoverflow.com/a/9576170/939860 – Erwin Brandstetter May 17 '19 at 16:56
I see you did not post your other question. I think I have a solution for that, too. – Erwin Brandstetter May 23 '19 at 22:39

Optimising function which extracts records with a minimum gap in timestamps

1 Answers1

Query

Index