2

I have data that looks like this

Name     |  Date                     |  Event     |  Event_ID
_____________________________________________________________
BRADLEY  |  2014-12-01 16:15:26.442  |  ACCESSED  |  268766  
BRADLEY  |  2014-12-01 16:15:36.794  |  ACCESSED  |  268766  
BRADLEY  |  2014-12-01 16:15:50.618  |  DENIED    |  268766  
BRADLEY  |  2014-12-01 16:16:04.89   |  DENIED    |  268766  
BRADLEY  |  2014-12-01 16:18:01.036  |  DENIED    |  268766  
BRADLEY  |  2014-12-01 16:18:31.335  |  DENIED    |  268766  
CHARLES  |  2014-12-01 08:33:34.831  |  ACCESSED  |  445317  
CHARLES  |  2014-12-01 08:33:44.041  |  ACCESSED  |  445317  
CHARLES  |  2014-12-01 14:56:49.872  |  ACCESSED  |  10333360
CHARLES  |  2014-12-01 14:56:57.549  |  ACCESSED  |  10333360
CHARLES  |  2014-12-01 14:56:59.248  |  ACCESSED  |  10333360
CHARLES  |  2014-12-01 14:56:62.221  |  ACCESSED  |  10333360
CHARLES  |  2014-12-01 14:56:63.226  |  ACCESSED  |  10333360

My requirement is that I need to remove events that are ACCESSED that are within 15 minutes apart. For example BRADLEY, I would remove the second accessed at timestamp 16:15:36.794. That part is easy for me as I can just logically join the same table together comparing current row to next row and do logic on the Date.

Now the issue I'm coming across is CHARLES. His Event_ID of 10333360 is a bit more complicated than BRADLEY's use case. For CHARLES, I will need to remove all ACCESSED with Event_ID of 10333360 except the one with Date of 14:56:49.872. That's because I need to remove all dates that are within 15 minutes at the start of a new Event_ID. The real world issue is that there are too much "duplicates" when a user is ACCESSED and I'm doing data cleanup to remove all these unnecessary ACCESSED data.

I thought about using window functions in Postgres but there doesn't seem to be anything that can help me with the logic in comparing the dates (http://www.postgresql.org/docs/9.1/static/functions-window.html)

I do have some ideas on how to tackle this problem using stored procedures and create temp tables so that I can actually use variables in a Java-like way. But of course I want it to be efficient and I'm hoping to learn new techniques on how to tackle a problem like this.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
cYn
  • 3,291
  • 6
  • 25
  • 43

1 Answers1

3

This is tricky, because which rows are deleted and which rows stay depend on the rest of the table. So you have a moving target. To pin the chicken down, I suggest you apply a grid (of 15 minutes in your case):

SELECT tbl_id, row_number() OVER (PARTITION BY grid_start, event_id, name
                                  ORDER BY date) rn
FROM  (
   SELECT g AS grid_start, g + interval '15 min' AS grid_end
   FROM  (SELECT min(date) AS mind, max(date) AS maxd
          FROM   tbl
          WHERE  event = 'ACCESSED') t
        , generate_series(t.mind, t.maxd, interval '15 min') g
   ) g
JOIN   tbl t ON t.date >= g.grid_start
            AND t.date <  g.grid_end
WHERE  event = 'ACCESSED';

The implicit LATERAL join requires Postgres 9.3+.

To substitute in older versions:

   FROM  (SELECT generate_series(min(date)
                               , max(date)
                               , interval '15 min') g
          FROM tbl WHERE event = 'ACCESSED') g

Now it's simple to DELETE:

DELETE FROM tbl t
USING (<above query>) x
WHERE  t.tbl_id = x.tbl_id
AND    x.rn > 1;

Only the first row per name in every 15 min interval survives.
Note that you can still have two rows within 15 minutes (neighboring grid cells), but never three. Or generally speaking: never more than n+1 rows per n consecutive intervals.

If that's not good enough I suggest a procedural solution. Iterate through qualifying rows, remember the date of the first survivor and then return the id (for deletion) of every following row until the date is > 15 minutes later. Remember the date of that next survivor etc. Similar to:

Aside: don't call a timestamp "date", that's misleading.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Wow, this is some kind of a solution. Currently testing it but one problem, is there a syntax error at 'generate_series(mind, maxd, interval '15 min') g'? – cYn Dec 23 '14 at 19:45
  • @cYn: You need Postgres 9.3 or later. since you did not specify I naturally assumed the *current* the version of Postgres. – Erwin Brandstetter Dec 23 '14 at 19:53
  • Ahh wait, never mind. That generate_series is being joined with the nested selected statement above it. The two 'g' alias got me confused for a bit. – cYn Dec 23 '14 at 19:55
  • @cYn: I added the missing event_id for the PARTITION and a variant for older versions without LATERAL. – Erwin Brandstetter Dec 23 '14 at 20:40
  • 1
    Simply amazing. Your code works. Now I just need to comb it through so I can understand it and throw it in my tool box. Thank you for this. – cYn Dec 23 '14 at 21:02