MySql query histogram for time intervals data

Question

I have an event input of this type

event user
event start
event end
event type

Inserted to MySql table, each in its own row with user+start as primary key.

I need to query an histogram for a type by time interval (say minute) counting events occurred on each time interval. something like:

SELECT count(*) as hits FROM events 
WHERE type="browsing" 
GROUP BY time_diff("2015-1-1" AND "2015-1-2") / 60 * second

but I could not find any way to do that in SQL besides writing code, any idea?

Sample data

user, start, end, type
1, 2015-1-1 12:00:00, 2015-1-1 12:03:59, browsing
2, 2015-1-1 12:03:00, 2015-1-1 12:06:00, browsing
2, 2015-1-1 12:03:00, 2015-1-1 12:06:00, eating
3, 2015-1-1 12:03:00, 2015-1-1 12:08:00, browsing

the result should look like this:

         ^
count    |
browsing |
users    |       *
         |       *  *  *  *
         | *  *  *  *  *  *  *  *
         --|--|--|--|--|--|--|--|--|--> minute
         0  1  2  3  4  5  6  7  8  9

Provide sample data and desired results. It is not exactly clear what you are looking for. — Gordon Linoff, Aug 16 '15 at 13:22
. . Add the sample data and desired results *in the question*. — Gordon Linoff, Aug 16 '15 at 13:30
What do you mean by "besides writing code"? Do you consider SQL statements to be code or not? — das-g, Aug 16 '15 at 13:45
@moshebeeri . . . There are two challenging parts to this. One is to get all the time units you need for the histogram (a list of minutes) and the other is to do the calculation. Do you have the first ? — Gordon Linoff, Aug 16 '15 at 13:48
Regarding `Select count(*) and hits`: I think you have to use a comma ('`,`') instead of '`and`' here, if you want both columns listed. Otherwise, the query might return only one column that is the result of applying an `AND` boolean operator (`TRUE` for rows where the values in both columns evaluate to `TRUE`, `FALSE` elsewhere). — das-g, Aug 16 '15 at 13:54
@das-g should be as not and :-) and I have no good idea on how to do that sql statement is not code unless you have to write stored procedure. — moshe beeri, Aug 16 '15 at 14:18
@c_x, why did you change `AS` back to `AND`? If I understood moshe correctly, that code is illustrative, not the subject of the question, thus fixing it to say what moshe meant should be OK. — das-g, Aug 16 '15 at 16:55

techdude · Accepted Answer · 2015-08-24T07:30:23.787

You can do this using group by with the level that you want. Here is an example using the data you gave:

First the SQL to create the table and populate it. The ID column here isn't "needed" but it is recommended if the table will be large or have indexes on it.

CREATE TABLE `test`.`events` (
  `id` INT NOT NULL AUTO_INCREMENT,
  `user` INT NULL,
  `start` DATETIME NULL,
  `end` DATETIME NULL,
  `type` VARCHAR(45) NULL,
  PRIMARY KEY (`id`));

INSERT INTO events (user, start, end, type) VALUES 
(1, '2015-1-1 12:00:00', '2015-1-1 12:03:59', 'browsing'),
(2, '2015-1-1 12:03:00', '2015-1-1 12:06:00', 'browsing'),
(2, '2015-1-1 12:03:00', '2015-1-1 12:06:00', 'eating'),
(3, '2015-1-1 12:03:00', '2015-1-1 12:08:00', 'browsing');

To get a list of ordered pairs of number of minutes duration to number of events:

The query can then be easily written using the timestampdiff fuction, as shown below:

SELECT 
    TIMESTAMPDIFF(MINUTE, start, end) as minutes,
    COUNT(*) AS numEvents
FROM
    test.events
GROUP BY TIMESTAMPDIFF(MINUTE, start, end)

The output:

minutes      numEvents
3            3
5            1

The first parameter in the select can be one of FRAC_SECOND, SECOND, MINUTE, HOUR, DAY, WEEK, MONTH, QUARTER, or YEAR.

Here are some more examples of queries you can do:

Events by hour (floor function is applied)

SELECT 
    TIMESTAMPDIFF(HOUR, start, end) as hours,
    COUNT(*) AS numEvents
FROM
    test.events
GROUP BY TIMESTAMPDIFF(HOUR, start, end)

**Events by hour with better formatting **

SELECT 
    CONCAT("<", TIMESTAMPDIFF(HOUR, start, end) + 1) as hours,
    COUNT(*) AS numEvents
FROM
    test.events
GROUP BY TIMESTAMPDIFF(HOUR, start, end)

You can group by a variety of options, but this should definitely get you started. Most plotting packages will allow you to specify arbitrary x y coordinates, so you don't need to worry about the missing values on the x axis.

To get a list of ordered pairs of number of events at a specific time (for logging): Note that this is left for reference.

Now for the queries. First you have to pick which item you want to use for the grouping. For example, a task might take more than a minute, so the start and end would be in different minutes. For all these examples, I am basing them off of the start time, since that is when the event actually took place.

To group event counts by minute, you can use a query like this:

SELECT 
     DATE_FORMAT(start, '%M %e, %Y %h:%i %p') as minute, 
     count(*) AS numEvents 
FROM test.events 
GROUP BY YEAR(start), MONTH(start), DAYOFMONTH(start), HOUR(start), MINUTE(start);

Note how this groups by all the items, starting with year, going the minute. I also have the minute displayed as a label. The resulting output looks like this:

minute                      numEvents
January 1, 2015 12:00 PM    1
January 1, 2015 12:03 PM    3

This is data that you could then take using php and prepare it for display by one of the many graphing libraries out there, plotting the minute column on the x axis, and plotting the numEvents on the y axis.

Here are some more examples of queries you can do:

Events by hour

SELECT 
     DATE_FORMAT(start, '%M %e, %Y %h %p') as hour, 
     count(*) AS numEvents 
FROM test.events 
GROUP BY YEAR(start), MONTH(start), DAYOFMONTH(start), HOUR(start);

Events by date

SELECT 
    DATE_FORMAT(start, '%M %e, %Y') as date, 
    count(*) AS numEvents 
FROM test.events 
GROUP BY YEAR(start), MONTH(start), DAYOFMONTH(start);

Events by month

SELECT 
    DATE_FORMAT(start, '%M %Y') as date, 
    count(*) AS numEvents 
FROM test.events 
GROUP BY YEAR(start), MONTH(start);

Events by year

SELECT 
    DATE_FORMAT(start, '%Y') as date, 
    count(*) AS numEvents 
FROM test.events 
GROUP BY YEAR(start);

I should also point out that if you have an index on the start column for this table, these queries will complete quickly, even with hundreds of millions of rows.

Hope this helps! Let me know if you have any other questions about this.

As for the graphing part, I'll leave an addendum here. There are many good libraries out there, and you can look into those (generally, we try to keep questions here on stack overflow to one question in a post, and we don't recommend tutorials, or services, because they are primarily opinion based.) However, the query I have here produces data that should be easy to graph in any of the major libraries. — techdude, Aug 16 '15 at 14:32
Check out D3 - http://d3js.org/ or ChartJS - http://www.chartjs.org/ for a couple of examples. — techdude, Aug 16 '15 at 14:34
Its cool idea, but it seems to not comply with the example in the question. on minute 4 for example there are two users browsing the web while your query seems to miss that. your query applies well for events by timestamp, my question is dealing with intervals — moshe beeri, Aug 16 '15 at 16:07
Ah, I misunderstood your question. I might update my answer, but I might not get to it for a few days. — techdude, Aug 16 '15 at 18:43
Hi, if you can/have an update I'll be more then happy if you'll post it. I believe this question arises a point that is not well covered by SQL in general, hence there is luck of knowledge in this field. — moshe beeri, Aug 23 '15 at 07:05
Ok, I updated the answer (I left the existing stuff, but I added the new stuff just below the table creation. The key is using TIMESTAMPDIFF with the first parameter set to the interval you want. Time stamp diff will let you convert the diff to the unit that you want (seconds, minutes, hours, etc). The numbers are floored, so 40 minutes is considered 0 hours. I added an example that adds one and puts the < sign. — techdude, Aug 24 '15 at 07:36

score 0 · Answer 2 · answered Aug 16 '15 at 15:19

0

I am going to assume that you have a numbers table that contains integers. You also have $starttime and $endtime.

This is one way to get the values you want:

select ($starttime + interval n.n - 1 minute) as thetime, n.n as minutes,
       count(sd.user)
from numbers n left join
     sampledata sd
     on $starttime + interval n.n - 1 minute between sd.start and sd.end
where $starttime + interval n.n - 1 minute <= $endtime and
      sd.end >= $starttime and
      sd.start <= $endtime
group by n.n
order by n.n;

answered Aug 16 '15 at 15:19

Gordon Linoff

1,242,037
58
646
786

10x Gordon, it seems that this solution will work, the basic idea is to join with time interval table where the interval should be between event time. thought this idea solves the problem I got a feeling it is not most efficient nor the Occam razor solution, especially when given more then one time series needed to be check for consequential event occurrence between them (this is the main issue I am dealing with e.g. comparing or counting users who browsed the web while listening to music.) I hope there is a better way given in SQL to do that. – moshe beeri Aug 17 '15 at 07:48

MySql query histogram for time intervals data

2 Answers2

Linked