Postgres: How to find nearest tsrange from timestamp outside of ranges?

Question

I am modeling (in Postgres 9.6.1 / postGIS 2.3.1) a booking system for local services provided by suppliers:

create table supplier (
    id                serial primary key,
    name              text not null check (char_length(title) < 280),
    type              service_type,
    duration          interval,
    ...
    geo_position      geography(POINT,4326)
    ...
);

Each supplier keeps a calendar with time slots when he/she is available to be booked:

create table timeslot (
    id                 serial primary key,
    supplier_id        integer not null references supplier(id),
    slot               tstzrange not null,

    constraint supplier_overlapping_timeslot_not_allowed
    exclude using gist (supplier_id with =, slot with &&)
);

For when a client wants to know which nearby suppliers are available to book at a certain time, I create a view and a function:

create view supplier_slots as
    select
        supplier.name, supplier.type, supplier.geo_position, supplier.duration, ...
        timeslot.slot
    from
        supplier, timeslot
    where
        supplier.id = timeslot.supplier_id;


create function find_suppliers(wantedType service_type, near_latitude text, near_longitude text, at_time timestamptz)
returns setof supplier_slots as $$
declare
    nearpoint geography;
begin
    nearpoint := ST_GeographyFromText('SRID=4326;POINT(' || near_latitude || ' ' || near_longitude || ')');
    return query
        select * from supplier_slots
        where type = wantedType
            and tstzrange(at_time, at_time + duration) <@ slot
        order by ST_Distance( nearpoint, geo_position )
        limit 100;
end;
$$ language plpgsql;

All this works really well.

Now, for the suppliers that did NOT have a bookable time slot at the requested time, I would like to find their closest available timeslots, before and after the requested at_time, also sorted by distance.

This has my mind spinning a little bit and I can't find any suitable operators to give me the nearest tsrange.

Any ideas on the smartest way to do this?

Do you want a total of 100 suppliers (with and without available time slot) Or a total of 100 timeslots? Or 200? The definition needs to be clear, it can result in a different solution. — Erwin Brandstetter, Dec 18 '16 at 15:12
Very valid question. This probably makes the most sense: Out of the X closest suppliers who offer the correct `service_type`, return those who have a `timeslot` either before or after `at_time` (or both) ordered by time interval between the closest `timeslot` and `at_time`. — Jesper We, Dec 18 '16 at 15:41

score 2 · Accepted Answer · edited May 23 '17 at 12:25

The solution depends on the exact definition of what you want.

Schema

I suggest these slightly adapted table definitions to make the task simpler, enforce integrity and improve performance:

CREATE TABLE supplier (
   supplier_id  serial PRIMARY KEY,
   supplier     text NOT NULL CHECK (length(title) < 280),
   type         service_type,
   duration     interval,
   geo_position geography(POINT,4326)
);

CREATE TABLE timeslot (
   timeslot_id  serial PRIMARY KEY,
   supplier_id  integer NOT NULL -- references supplier(id),
   slot_a       timestamptz NOT NULL,
   slot_z       timestamptz NOT NULL,
   CONSTRAINT   timeslot_range_valid CHECK (slot_a < slot_z)
   CONSTRAINT   timeslot_no_overlapping
     EXCLUDE USING gist (supplier_id WITH =, tstzrange(slot_a, slot_z) WITH &&)
);

CREATE INDEX timeslot_slot_z ON timeslot (supplier_id, slot_z);
CREATE INDEX supplier_geo_position_gist ON supplier USING gist (geo_position);

Save two timestamptz columns slot_a and slot_z instead of the tstzrange column slot - and adapt constraints accordingly. This treats all ranges as default inclusive lower and exclusive upper bounds automatically now - which avoids corner case errors / headache.

Collateral benefit: only 16 bytes for 2 timestamptz instead of 25 bytes (32 with padding) for the tstzrange.
All queries you might have had on slot keep working with tstzrange(slot_a, slot_z) as drop-in replacement.
Add an index on (supplier_id, slot_z) for the query at hand.
And a spatial index on supplier.geo_position (which you probably have already).

Depending on data distribution in type, a couple of partial indexes for types common in queries might help performance:
```
CREATE INDEX supplier_geo_type_foo_gist ON supplier USING gist (geo_position)
WHERE supplier = 'foo'::service_type;
```

Query / Function

This query finds the X closest suppliers who offer the correct service_type (100 in the example), each with the one closest matching time slot (defined by the time distance to the start of the slot). I combined this with actually matching slots, which may or may not be what you need.

CREATE FUNCTION f_suppliers_nearby(_type service_type, _lat text, _lon text, at_time timestamptz)
  RETURNS TABLE (supplier_id  int
               , name         text
               , duration     interval
               , geo_position geography(POINT,4326)
               , distance     float 
               , timeslot_id  int
               , slot_a       timestamptz
               , slot_z       timestamptz
               , time_dist    interval
   ) AS
$func$
   WITH sup_nearby AS (  -- find matching or later slot
      SELECT s.id, s.name, s.duration, s.geo_position
           , ST_Distance(ST_GeographyFromText('SRID=4326;POINT(' || _lat || ' ' || _lon || ')')
                          , geo_position) AS distance
           , t.timeslot_id, t.slot_a, t.slot_z
           , CASE WHEN t.slot_a IS NOT NULL
                  THEN GREATEST(t.slot_a - at_time, interval '0') END AS time_dist
      FROM   supplier s
      LEFT   JOIN LATERAL (
         SELECT *
         FROM   timeslot
         WHERE  supplier_id = supplier_id
         AND    slot_z > at_time + s.duration  -- excl. upper bound
         ORDER  BY slot_z
         LIMIT  1
         ) t ON true
      WHERE  s.type = _type
      ORDER  BY s.distance
      LIMIT  100
      )
   SELECT *
   FROM  (
      SELECT DISTINCT ON (supplier_id) *  -- 1 slot per supplier
      FROM  (
         TABLE sup_nearby  -- matching or later slot

         UNION ALL         -- earlier slot
         SELECT s.id, s.name, s.duration, s.geo_position
              , s.distance
              , t.timeslot_id, t.slot_a, t.slot_z
              , GREATEST(at_time - t.slot_a, interval '0') AS time_dist
         FROM   sup_nearby s
         CROSS  JOIN LATERAL (  -- this time CROSS JOIN!
            SELECT *
            FROM   timeslot
            WHERE  supplier_id = s.supplier_id
            AND    slot_z <= at_time  -- excl. upper bound
            ORDER  BY slot_z DESC
            LIMIT  1
            ) t
         WHERE  s.time_dist IS DISTINCT FROM interval '0'  -- exact matches are done
         ) sub
      ORDER  BY supplier_id, time_dist  -- pick temporally closest slot per supplier
   ) sub
   ORDER  BY time_dist, distance;  -- matches first, ordered by distance; then misses, ordered by time distance

$func$  LANGUAGE sql;

I did not use your view supplier_slots and optimized for performance instead. The view may still be convenient. You might include tstzrange(slot_a, slot_z) AS slot for backward compatibility.

The basic query to find the 100 closest suppliers is a textbook "K Nearest Neighbour" problem. A GiST index works well for this. Related:

How do I query all rows within a 5-mile radius of my coordinates?

The additional task (find the temporally nearest slot) can be split in two tasks: to find the next higher and the next lower row. The core feature of the solution is to have two subqueries with ORDER BY slot_z LIMIT 1 and ORDER BY slot_z DESC LIMIT 1, which result in two very fast index scans.

I combined the first one with finding actual matches, which is a (smart, I think) optimization, but may distract from the actual solution.

Quite a mouthful! Let me digest that for a bit. At least I am glad I didn't ask a stupid question with an obvious answer ;-) — Jesper We, Dec 18 '16 at 19:00
@JesperWe: I did not actually test the function. But it should be as fast as it gets, with only index scans. — Erwin Brandstetter, Dec 18 '16 at 21:27
Doesn't matter, the typos where easily fixed. The solution doesn't really do the job (you change it to find both available and near slots in the same query, which I like because that's what I thought about doing in the first place, but I decided to have two separate queries. If you do it in one query you have to take into account the situations like when there is only partial overlap between time_at + duration and the timeslot, and so on. But anyways, you pointed me the way to go, now I go :-) — Jesper We, Dec 19 '16 at 13:24
@JesperWe: Cool. Partial indexes may be very useful to optimize performance. I added some more explanation. If there are still typos in the code, feel free to fix. — Erwin Brandstetter, Dec 19 '16 at 14:45

Postgres: How to find nearest tsrange from timestamp outside of ranges?

1 Answers1

Schema

Query / Function

Linked