0

I have a table with some statistical values from a process. The table has the following format:

CREATE TABLE data (
    process    integer   NOT NULL,
    time       timestamp NOT NULL
    first      double precision,
    last       double precision,
    first_time timestamp,
    last_time  timestamp
)

The data in this table is inserted every minute, and contains the aggregate value of the last minute. For example, for a process 1, we can have the following data:

+---------------------------------------------------------------------------------+
|             process | time | first | last  | first_time | last_time             |
+---------------------------------------------------------------------------------+
| 1 | 2014-09-22 12:00:00 | 100 | 200 | 2014-09-22 12:00:00 | 2014-09-22 12:00:59 |
| 1 | 2014-09-22 12:01:00 | 104 | 152 | 2014-09-22 12:01:00 | 2014-09-22 12:01:59 |
| 1 | 2014-09-22 12:02:00 | 141 | 155 | 2014-09-22 12:02:10 | 2014-09-22 12:02:59 |
| 1 | 2014-09-22 12:03:00 | 122 | 147 | 2014-09-22 12:03:00 | 2014-09-22 12:02:45 |
+---------------------------------------------------------------------------------+

As you can see in the third row, there are times when the first value is not the second 0 of the minute. In the last time this also happen (fourth row).

Using the first and last functions from this page, and the date_round function from this page, I want to select the first value of a given process in a 30 minute interval.

When I try the following two queries, both the result are correct.

SELECT
    date_round(time, '30 min'::interval) AS "time",
    first(first)
FROM
    data
WHERE
    process = 1
    AND
    time > '2014-09-20 00:00:00'
    AND
    time < '2014-09-22 18:00:00'
GROUP BY 1
ORDER BY 1

and

SELECT
    date_round(time, '30 min'::interval) AS "time",
    first(first) AS "value"
FROM (
    SELECT
        time,
        first
    FROM
        data
    WHERE
        process = 1
        AND
        time > '2014-09-20 00:00:00'
        AND
        time < '2014-09-22 18:00:00'
    ORDER BY 1
) A
GROUP BY 1
ORDER BY 1

My question is: Why the first query works in this case? Is postgres sort the data before apply the group by clause?

My doubt is because the "first" function returns the first value of the two passed arguments. If the data is not sorted, then the first call in the "first" aggregate function can contain a value that is not in the correct order, resulting in an invalid value, right?

Can I use the first query safely or should I use the second query?

Community
  • 1
  • 1
cmac
  • 38
  • 5

1 Answers1

2

This is the query you really want:

SELECT distinct on (date_round(time, '30 min'::interval))
       date_round(time, '30 min'::interval) AS "time",
       first
FROM data
WHERE process = 1 AND
      time > '2014-09-20 00:00:00' AND
      time < '2014-09-22 18:00:00'
ORDER BY date_round(time, '30 min'::interval), time;

Postgres doesn't do the ordering before the group by -- no database does that by the definition of SQL. It happens to encounter the earliest record first, but there is no guarantee. In fact, I don't think the second version is guaranteed to be ordered either (although I cannot find explicit Postgres documentation on this point).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786