5

Excuse me if this is confusing, as I am not very familiar with postgresql. I have a postgres database with a table full of "sites". Each site reports about once an hour, and when it reports, it makes an entry in this table, like so:

site |      tstamp
-----+--------------------
6000 | 2013-05-09 11:53:04
6444 | 2013-05-09 12:58:00
6444 | 2013-05-09 13:01:08
6000 | 2013-05-09 13:01:32
6000 | 2013-05-09 14:05:06
6444 | 2013-05-09 14:06:25
6444 | 2013-05-09 14:59:58
6000 | 2013-05-09 19:00:07

As you can see, the time stamps are almost never on-the-nose, and sometimes there will be 2 or more within only a few minutes/seconds of each other. Furthermore, some sites won't report for hours at a time (on occasion). I want to only select one entry per site, per hour (as close to each hour as I can get). How can I go about doing this in an efficient way? I also will need to extend this to other time frames (like one entry per site per day -- as close to midnight as possible).

Thank you for any and all suggestions.

BLuFeNiX
  • 2,496
  • 2
  • 20
  • 40

3 Answers3

10

You could use DISTINCT ON:

select distinct on (date_trunc('hour', tstamp)) site, tstamp
from t
order by date_trunc('hour', tstamp), tstamp

Be careful with the ORDER BY if you care about which entry you get.

Alternatively, you could use the row_number window function to mark the rows of interest and then peel off the first result in each group from a derived table:

select site, tstamp
from (
    select site, tstamp,
           row_number() over (partition by date_trunc('hour', tstamp) order by tstamp) as r
    from t
) as dt
where r = 1

Again, you'd adjust the ORDER BY to select the specific row of interest for each date.

mu is too short
  • 426,620
  • 70
  • 833
  • 800
  • 1
    @BLuFeNiX: This one! And for hours or other time intervals use [`date_trunc('hour', tstamp)`](http://www.postgresql.org/docs/current/interactive/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC). [Here](http://stackoverflow.com/questions/12623358/group-by-data-intervals/12624551#12624551) and [here](http://stackoverflow.com/questions/12730072/truncate-timestamp-to-arbitrary-intervals/12731753#12731753) are a related answer for arbitrary intervals. – Erwin Brandstetter May 09 '13 at 21:31
  • @ErwinBrandstetter: How could I have confused "hour" with "date"? Odd how the brain, fingers, and eyes sometimes live in different worlds. – mu is too short May 09 '13 at 21:37
  • @muistooshort: Thanks! Your first query works, but is very slow. Any suggestions on what to do to make it faster? I get an error on row_number() in the second query. Also, this query will eventually be paged, will that matter? – BLuFeNiX May 10 '13 at 16:50
  • @BLuFeNiX: Which version of PostgreSQL do you have? An index on `date_trunc('hour', stamp)` might help but indexing on function values (and window functions) depend on the PostgreSQL version. – mu is too short May 10 '13 at 17:26
  • @muistooshort: My client is 9.2.4, server is 9.2.2, do I need something else? Thanks for the help. – BLuFeNiX May 10 '13 at 17:55
  • Then you can index on a function call result and see if that speeds things up. And the `row_number` window function should be fine with that version, what is the error message? – mu is too short May 10 '13 at 18:05
  • @muistooshort: ERROR: syntax error at or near "row_number" LINE 4: row_number() over (partition by site, to_char... – BLuFeNiX May 10 '13 at 19:23
  • You have a typo in your SQL, probably a missing comma. Also note that you've switched to `to_char` where I'm using `date_trunc` so... – mu is too short May 10 '13 at 19:26
  • @muistooshort: Opps sorry, that was the wrong output. Your query seems to work fine now that I've tried it again. It's very slow, I'll try adding that index. Thank you! Also, is there any way to retain the original time stamp value? I simplified the problem in order to post it here, but I also have other (non-unique) columns I need to return and further manipulate. I need to make sure I actually display 11:58:26, not 12:00:00. – BLuFeNiX May 10 '13 at 19:43
  • The queries should work if you include other columns. PS: you don't have to `@` me in comments on my answers unless there are other people in the thread and you're specifically talking to me, I'll get notified about all comments on my stuff. – mu is too short May 10 '13 at 20:15
  • Okay, thanks. But how do I go about retaining the real timestamp information? – BLuFeNiX May 10 '13 at 20:23
  • I don't understand the problem, the real timestamp is right there in `tstamp`. – mu is too short May 10 '13 at 21:05
  • Since you are truncating the date, wouldn't this only pick up 12:01 as 12:00. and not 11:59 – Mike Furlender Nov 23 '17 at 18:58
5

You are looking for the closest value per hour. Some are before the hour and some are after. That makes this a hardish problem.

First, we need to identify the range of values that work for a particular hour. For this, I'll consider anything from 15 minutes before the hour to 45 minutes after as being for that hour. So, the period of consideration for 2:00 goes from 1:45 to 2:45 (arbitrary, but seems reasonable for your data). We can do this by shifting the time stamps by 15 minutes.

Second, we need to get the closest value to the hour. So, we prefer 1:57 to 2:05. We can do this by considering the first value in (57, 60 - 57, 5, 60 - 5).

We can put these rules into a SQL statement, using row_number():

select site, tstamp, usedTimestamp
from (select site, tstamp,
             date_trunc('hour', tstamp + 'time 00:15') as usedTimestamp
             row_number() over (partition by site, to_char(tstamp + time '00:15', 'YYYY-MM-DD-HH24'),
                                order by least(extract(minute from tstamp), 60 - extract(minute from tstamp))
                               ) as seqnum
      from t
     ) as dt
where seqnum = 1;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

For the extensibility aspect of your question.

I also will need to extend this to other time frames (like one entry per site per day

From the distinct set of site ids, and using a (recursive) CTE, I would build a set comprised of one entry per site per hour (or other specified interval), within a specified StartDateTime, EndDateTime range.

          SITE..THE DATE-TIME-HOUR
          6000  12.1.2013 00:00:00
          6000  12.1.2013 01:00:00
          .
          .
          . 
          6000  12.1.2013 24:00:00              
          7000  12.1.2013 00:00:00        
          7000  12.1.2013 01:00:00
          .
          .
          . 
          7000  12.1.2013 24:00:00

Then I would left join that CTE against your SITES log on site id and on the min absolute difference between the CTE point-in-time and the LOG's point-in-time.

That way you are assured of a row for each site per interval.

P.S. For a site that has not phoned home for a long time, its most recent phone-in timestamp will be repeated multiple times as the closest one available.

Tim
  • 8,669
  • 31
  • 105
  • 183