19

Suppose I have a table with 3 columns:

  • id (PK, int)
  • timestamp (datetime)
  • title (text)

I have the following records:

1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title

I need to do a GROUP BY records that are within 3 seconds of each other. For this table, rows 1 and 2 would be grouped together.

There is a similar question here: Mysql DateTime group by 15 mins

I also found this: http://www.artfulsoftware.com/infotree/queries.php#106

I don't know how to convert these methods into something that will work for seconds. The trouble with the method on the SO question is that it seems to me that it would only work for records falling within a bin of time that starts at a known point. For instance, if I were to get FLOOR() to work with seconds, at an interval of 5 seconds, a time of 15:00:04 would be grouped with 15:00:01, but not grouped with 15:00:06.

Does this make sense? Please let me know if further clarification is needed.

EDIT: For the set of numbers, {1, 2, 3, 4, 5, 6, 7, 50, 51, 60}, it seems it might be best to group them {1, 2, 3, 4, 5, 6, 7}, {50, 51}, {60}, so that each grouping row depends on if the row is within 3 seconds of the previous. I know this changes things a bit, I'm sorry for being wishywashy on this.

I am trying to fuzzy-match logs from different servers. Server #1 may log an item, "Item #1", and Server #2 will log that same item, "Item #1", within a few seconds of server #1. I need to do some aggregate functions on both log lines. Unfortunately, I only have title to go on, due to the nature of the server software.

Community
  • 1
  • 1
Brad
  • 159,648
  • 54
  • 349
  • 530
  • 1
    still ambiguous. if the seconds were 1,2,3,4,5,6 then there are many groupings of 3 seconds possible where any given row could be in multiple groups... – Randy Jul 01 '11 at 17:29
  • @Randy, excellent point. Clarifying... – Brad Jul 01 '11 at 17:29
  • 1
    The reason that you need to pick distinct times is that otherwise grouping "within 3 seconds" probably doesn't make much sense. What if you had a series of 30 rows each exactly 2 seconds after the previous one? Now the first and last are almost a minute apart, but would be grouped because of the "chain" of rows. – Tom H Jul 01 '11 at 17:30
  • 1
    Do you want all records with in 3 seconds of each other to be grouped together? i.e., 0:01,0:02,0:03,0:04 group to one group? – dfb Jul 01 '11 at 17:30
  • @Tom H., @spinning_plate, good points, please see my last edit, which should give better insight into what I am trying to accomplish. – Brad Jul 01 '11 at 17:34
  • 2
    In Oracle, i would probably approach this as a LAG function. Check each row to see if it matches the previous row within 3 seconds.. if it does, set the 'group' to that previous rows timestamp, if not, use the current row timestamp. – Randy Jul 01 '11 at 17:38
  • @Randy, that sounds like an excellent idea. Do you know if this can be implemented or mimicked in MySQL? – Brad Jul 01 '11 at 17:42
  • @Randy: MySQL's never had analytic functions, sadly. – OMG Ponies Jul 01 '11 at 17:42
  • I'm sure that someone will prove me wrong, but I'm having a hard-time coming up with a set-based solution given that the "first" row in each group relies on determining groups for previous rows. You may need to approach this in a sequential way (cursors, etc.). If I can think of something though, then I'll post it. – Tom H Jul 01 '11 at 17:44
  • http://onlamp.com/pub/a/mysql/2007/04/12/emulating-analytic-aka-ranking-functions-with-mysql.html?page=2 I'm digging into this now and will post back with what I find. – Brad Jul 01 '11 at 17:45
  • Please clarify what you expect when you have records where the date is the same to the minute, but seconds are: 1,2,3, and 4? Any aggregate function will be double counting, because 1,2,3 are within three seconds... but so are 2,3,4 and so on. – OMG Ponies Jul 01 '11 at 17:54
  • @OMG Ponies, for this application, I would expect seconds 1, 2, 3, 4 to be grouped in the same group. Seconds 8, 9, 10, for example, would be in the next group. It is acceptable that as long as the title matches, and it is within 3 seconds of the last row with the matching title, that it appears in the same group, regardless of how long this goes on. For my purposes, it is acceptable for logs with the same title to be at seconds 1 through 30 for example, and these would all be in the same group. – Brad Jul 01 '11 at 17:59
  • I think I can achieve this using `LAG()` as suggested by Randy. I just need to figure out how to do it in MySQL. I found a resource (http://explainextended.com/2009/03/10/analytic-functions-first_value-last_value-lead-lag/) that seems to do exactly what is needed... now I just need to figure it all out. – Brad Jul 01 '11 at 18:01
  • If this is for a one-off operation, consider loading the data into a database that supports analytics (SQL Server 2005+ Express is very close to MySQL for data types). Otherwise, read: http://explainextended.com/2009/03/12/analytic-functions-optimizing-lag-lead-first_value-last_value/ – OMG Ponies Jul 01 '11 at 18:03

5 Answers5

19

I'm using Tom H.'s excellent idea but doing it a little differently here:

Instead of finding all the rows that are the beginnings of chains, we can find all times that are the beginnings of chains, then go back and ifnd the rows that match the times.

Query #1 here should tell you which times are the beginnings of chains by finding which times do not have any times below them but within 3 seconds:

SELECT DISTINCT Timestamp
FROM Table a
LEFT JOIN Table b
ON (b.Timestamp >= a.TimeStamp - INTERVAL 3 SECONDS
    AND b.Timestamp < a.Timestamp)
WHERE b.Timestamp IS NULL

And then for each row, we can find the largest chain-starting timestamp that is less than our timestamp with Query #2:

SELECT Table.id, MAX(StartOfChains.TimeStamp) AS ChainStartTime
FROM Table
JOIN ([query #1]) StartofChains
ON Table.Timestamp >= StartOfChains.TimeStamp
GROUP BY Table.id

Once we have that, we can GROUP BY it as you wanted.

SELECT COUNT(*) --or whatever
FROM Table
JOIN ([query #2]) GroupingQuery
ON Table.id = GroupingQuery.id
GROUP BY GroupingQuery.ChainStartTime

I'm not entirely sure this is distinct enough from Tom H's answer to be posted separately, but it sounded like you were having trouble with implementation, and I was thinking about it, so I thought I'd post again. Good luck!

Chris Cunningham
  • 1,875
  • 1
  • 15
  • 27
  • 1
    Thank you for the further clarification. I finally got it working! I made a couple minor changes. First is that with query #2, I needed a `GROUP BY Table.id`. Also in query #2, I changed `>` to `>=` to take care of timestamps that are equal to the start of the chain. I have since implemented this all in my actual code, and it works great! Thanks to you and Tom H. as well. – Brad Jul 03 '11 at 20:30
  • Good points; those changes are definitely necessary. I'll go back and put them in. Glad we were able to make it happen for you! – Chris Cunningham Jul 03 '11 at 21:49
  • 1
    Also works great on SQL Server if you change the INTERVAL line to: ON (b.Timestamp >= DATEADD(second, -3, a.TimeStamp) AND... – Ryan Barton Aug 28 '12 at 15:21
  • Awesome query. Helped immensely while trying to find datetimes within 2 seconds of each other for analytical purposes. – Josh Davis Jul 29 '14 at 20:09
  • One of the most useful snippets of SQL I've found in over 10 years. – Peter Hanneman Jun 17 '16 at 09:06
6

Now that I think that I understand your problem, based on your comment response to OMG Ponies, I think that I have a set-based solution. The idea is to first find the start of any chains based on the title. The start of a chain is going to be defined as any row where there is no match within three seconds prior to that row:

SELECT
    MT1.my_id,
    MT1.title,
    MT1.my_time
FROM
    My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
    MT2.title = MT1.title AND
    (
        MT2.my_time < MT1.my_time OR
        (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
    ) AND
    MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
    MT2.my_id IS NULL

Now we can assume that any non-chain starters belong to the chain starter that appeared before them. Since MySQL doesn't support CTEs, you might want to throw the above results into a temporary table, as that would save you the multiple joins to the same subquery below.

SELECT
    SQ1.my_id,
    COUNT(*)  -- You didn't say what you were trying to calculate, just that you needed to group them
FROM
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ1
INNER JOIN My_Table MT3 ON
    MT3.title = SQ1.title AND
    MT3.my_time >= SQ1.my_time
LEFT OUTER JOIN
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ2 ON
    SQ2.title = SQ1.title AND
    SQ2.my_time > SQ1.my_time AND
    SQ2.my_time <= MT3.my_time
WHERE
    SQ2.my_id IS NULL

This would look much simpler if you could use CTEs or if you used a temporary table. Using the temporary table might also help performance.

Also, there will be issues with this if you can have timestamps that match exactly. If that's the case then you will need to tweak the query slightly to use a combination of the id and the timestamp to distinguish rows with matching timestamp values.

EDIT: Changed the queries to handle exact matches by timestamp.

Tom H
  • 46,766
  • 14
  • 87
  • 128
  • Ooh!! ooh! This is a beautiful idea! – Chris Cunningham Jul 01 '11 at 20:11
  • @Tom H., many of my timestamps do match exactly. It is very common to see 3 rows with the same title, same time, and different server IDs (and naturally, different IDs for the PK on the table). I have been scratching my head on how to modify your queries, but I can't seem to get it. Can you point me in the right direction? – Brad Jul 01 '11 at 22:09
  • I just made the change to handle cases of exact matches on the timestamps. It's late and I'm tired, but I think I got it right. – Tom H Jul 02 '11 at 04:51
2

Simple query:

SELECT * FROM time_history GROUP BY ROUND(UNIX_TIMESTAMP(time_stamp)/3);
Brad
  • 159,648
  • 54
  • 349
  • 530
Pavan Kumar N
  • 427
  • 1
  • 5
  • 17
  • 3
    Your solution would work for many cases, but not for the specific case I have. By rounding, it is possible for two entries to be within one second of each other, yet fall into different buckets. +1 anyway though, as this is a quick way to get the job mostly done. – Brad Mar 12 '13 at 15:42
2

Warning: Long answer. This should work, and is fairly neat, except for one step in the middle where you have to be willing to run an INSERT statement over and over until it doesn't do anything since we can't do recursive CTE things in MySQL.

I'm going to use this data as the example instead of yours:

id    Timestamp
1     1:00:00
2     1:00:03
3     1:00:06
4     1:00:10

Here is the first query to write:

SELECT a.id as aid, b.id as bid
FROM Table a
JOIN Table b 
ON (a.Timestamp is within 3 seconds of b.Timestamp)

It returns:

aid     bid
1       1
1       2
2       1
2       2
2       3
3       2
3       3
4       4

Let's create a nice table to hold those things that won't allow duplicates:

CREATE TABLE
Adjacency
( aid INT(11)
, bid INT(11)
, PRIMARY KEY (aid, bid) --important for later
)

Now the challenge is to find something like the transitive closure of that relation.

To do so, let's find the next level of links. by that I mean, since we have 1 2 and 2 3 in the Adjacency table, we should add 1 3:

INSERT IGNORE INTO Adjacency(aid,bid)
SELECT adj1.aid, adj2.bid
FROM Adjacency adj1
JOIN Adjacency adj2
ON (adj1.bid = adj2.aid)

This is the non-elegant part: You'll need to run the above INSERT statement over and over until it doesn't add any rows to the table. I don't know if there is a neat way to do that.

Once this is over, you will have a transitively-closed relation like this:

aid     bid
1       1
1       2
1       3     --added
2       1
2       2
2       3
3       1     --added
3       2
3       3
4       4

And now for the punchline:

SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid

returns:

aid     Neighbors
1       1,2,3
2       1,2,3
3       1,2,3
4       4

So

SELECT DISTINCT Neighbors
FROM (
     SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
     FROM Adjacency
     GROUP BY aid
     ) Groupings

returns

Neighbors
1,2,3
4

Whew!

Chris Cunningham
  • 1,875
  • 1
  • 15
  • 27
  • I have a concern that my Neighbors column might sometimes have `1,2,3` and sometimes `2,1,3`, so maybe there should be an `ORDER BY` in the innermost `GROUP_CONCAT` query to make sure things come out in order. – Chris Cunningham Jul 01 '11 at 18:56
  • One more thing: You don't have to run the INSERT statement infinitely many times... only as many times as the largest-sized group you want to make sure you capture. If you are pretty sure that there are no groups of size 100, you should be able to run it 100 times and it will do nothing most of the time. The problem with very large groups is that the `Adjacency` table will become large since a group of `n` things has on the order of `n^2` adjacency relations. – Chris Cunningham Jul 01 '11 at 19:03
  • Thanks! I have worked through this solution on my tables, and it does seem to be working nicely. – Brad Jul 01 '11 at 20:42
  • Good! -- but if you ever get sick of running all the `INSERT`s in my answer, you should implement @Tom H.'s idea which I am a big fan of. I haven't looked at his code so I can't speak to the implementation, but his plan solves your problem more specifically, so it might be worth looking into later! – Chris Cunningham Jul 01 '11 at 20:56
  • I am working on implementing his right now. I am running into some trouble, but I think I will figure it out. I'll post back with more once I figure out what I am doing wrong. – Brad Jul 01 '11 at 21:39
2

I like @Chris Cunningham's answer, but here's another take on it.

First, my understanding of your problem statement (correct me if I'm wrong):

You want to look at your event log as a sequence, ordered by the time of the event, and partitition it into groups, defining the boundary as being an interval of more than 3 seconds between two adjacent rows in the sequence.

I work mostly in SQL Server, so I'm using SQL Server syntax. It shouldn't be too difficult to translate into MySQL SQL.

So, first our event log table:

--
-- our event log table
--
create table dbo.eventLog
(
  id       int          not null ,
  dtLogged datetime     not null ,
  title    varchar(200) not null ,

  primary key nonclustered ( id ) ,
  unique clustered ( dtLogged , id ) ,

)

Given the above understanding of the problem statement, the following query should give you the upper and lower bounds your groups. It's a simple, nested select statement with 2 group by to collapse things:

  • The innermost select defines the upper bound of each group. That upper boundary defines a group.
  • The outer select defines the lower bound of each group.

Every row in the table should fall into one of the groups so defined, and any given group may well consist of a single date/time value.

[edited: the upper bound is the lowest date/time value where the interval is more than 3 seconds]

select dtFrom = min( t.dtFrom ) ,
       dtThru =      t.dtThru
from ( select dtFrom = t1.dtLogged ,
              dtThru = min( t2.dtLogged )
       from      dbo.EventLog t1
       left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                and datediff(second,t1.dtLogged,t2.dtLogged) > 3
       group by t1.dtLogged
     ) t
group by t.dtThru

You could then pull rows from the event log and tag them with the group to which they belong thus:

select *
from ( select dtFrom = min( t.dtFrom ) ,
              dtThru =      t.dtThru
       from ( select dtFrom = t1.dtLogged ,
                     dtThru = min( t2.dtLogged )
              from      dbo.EventLog t1
              left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                       and datediff(second,t1.dtLogged,t2.dtLogged) > 3
              group by t1.dtLogged
            ) t
       group by t.dtThru
     ) period
join dbo.EventLog t on t.dtLogged >=           period.dtFrom
                   and t.dtLogged <= coalesce( period.dtThru , t.dtLogged )
order by period.dtFrom , period.dtThru , t.dtLogged

Each row is tagged with its group via the dtFrom and dtThru columns returned. You could get fancy and assign an integral row number to each group if you want.

Nicholas Carey
  • 71,308
  • 16
  • 93
  • 135
  • Are you sure this works if a group has more than 3 things in it? It looks to me that your innermost query will find the largest time that is within 3 seconds of the first time, but not (the largest time that is within 3 seconds of (the largest time that is within 3 seconds of the first time)) :/ – Chris Cunningham Jul 01 '11 at 19:33
  • I fixed the query. My definition of the upper bound was wrong: the correct flavor should be 'the lowest date/time value greater than or equal to the current row's such that the delta between the two is more than 3 seconds.' It's been a while since I've had to do something like this. I'm getting rusty. – Nicholas Carey Jul 01 '11 at 19:47