Correct way to eliminate duplicates from a window query

Question

I am trying to create a version history for a software installation based on diagnostic data received on certain dates. The data is in a PostgreSQL database:

SELECT version();
                                                version
-------------------------------------------------------------------------------------------------------
 PostgreSQL 10.14 on x86_64-pc-linux-gnu, compiled by x86_64-unknown-linux-gnu-gcc (GCC) 4.9.4, 64-bit

The table's schema is like this:

CREATE TABLE cluster_info (
  cluster_id uuid,
  date timestamp,
  version text,
  PRIMARY KEY (cluster_id, date)
);

The relevant data looks like this:

select cluster_id, version, date 
from cluster_info 
where cluster_id = 'e2865aec-0ce1-11ec-afda-0242c0a8a003' 
order by date;

              cluster_id              |        date         | version
--------------------------------------+---------------------+--------------
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-03-15 10:30:47 | 6.0.5
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-03 20:32:33 | 6.0.5
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-08 14:57:05 | 6.0.7
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-20 16:59:45 | 6.0.7
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-21 00:21:43 | 6.0.5, 6.0.7
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-21 18:45:45 | 6.0.5, 6.0.7
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-22 20:05:10 | 6.0.5, 6.0.6
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-23 11:54:39 | 6.0.5, 6.0.6
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-24 15:01:09 | 6.0.7
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-24 19:21:14 | 6.0.7
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-05-28 20:06:29 | 6.0.6
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-09 05:20:32 | 6.0.6
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-11 12:05:03 | 6.0.8
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-17 17:46:10 | 6.0.8
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-24 14:44:55 | 6.0.6
 e2865aec-0ce1-11ec-afda-0242c0a8a003 | 2019-07-26 14:54:33 | 6.0.6

My first instinct would be to use min and max with group by, but it's possible that a cluster could be downgraded to a previous version after being upgraded. In that case I would want to show separate timespans for each period when the cluster was on a given version, and group by wouldn't accomplish this.

I attempted to use min and max window functions partitioned by version, which did not work as I expected either:

select distinct * from (select 
   version, 
   min(date) over (partition by version), 
   max(date) over (partition by version) 
from cluster_info 
where cluster_id = 'e2865aec-0ce1-11ec-afda-0242c0a8a003' 
order by date) x;

 version      |         min         |         max
--------------+---------------------+---------------------
 6.0.5        | 2019-03-15 10:30:47 | 2019-05-03 20:32:33
 6.0.5, 6.0.6 | 2019-05-22 20:05:10 | 2019-05-23 11:54:39
 6.0.5, 6.0.7 | 2019-05-21 00:21:43 | 2019-05-21 18:45:45
 6.0.6        | 2019-05-28 20:06:29 | 2019-07-26 14:54:33
 6.0.7        | 2019-05-08 14:57:05 | 2019-05-24 19:21:14
 6.0.8        | 2019-07-11 12:05:03 | 2019-07-17 17:46:10

What is the correct way to do this?

Edit: updated to include version and schema, and use an example dataset that exhibits the downgrade problem and shows that my initial solution was incorrect.

Your `date` is obviously type `timestamp`, not `date`. And your `cluster_id` seems to be `uuid`? Please start with a table definition (`CREATE TABLE ...`) showing data types and constraints. And always disclose your version of Postgres (`SELECT version()`). — Erwin Brandstetter, Sep 10 '21 at 23:19
`it's possible that a cluster could be downgraded to a previous version after being upgraded.` Then please provide sample data accordingly. The current sample does not reveal the problematic case. — Erwin Brandstetter, Sep 10 '21 at 23:27
Sorry for the useless example. I have edited the question to have a relevant example and include the other missing information you requested. I tried your solution and I confirmed that it works as expected on this dataset. — J.B. Langston, Sep 13 '21 at 16:06

score 1 · Accepted Answer · answered Sep 10 '21 at 23:48

You need more sophistication if version downgrades (or NULL values?) are possible:

SELECT min(version) AS version, min(date), max(date)
FROM  (
   SELECT version, date
        , count(*) FILTER (WHERE step IS NOT FALSE) OVER (ORDER BY date) AS grp
   FROM  (
      SELECT version, date
           , lag(version) OVER (ORDER BY date) <> version AS step
      FROM   cluster_info
      WHERE  cluster_id = '0f4ce21e-0d08-11ec-b209-0242c0a8c004'
      ORDER  BY date
      ) sub1
   ) sub2
GROUP  BY grp;

db<>fiddle here (sample data extended with version downgrade and unknown version)

See (with detailed explanation and links to more):

Correct way to eliminate duplicates from a window query

1 Answers1