0

How can you select a fixed number of rows from a table, in which a timestamp column, and the return rows are equally distant from each other. I need these points as sample points to use to plot a timeseries. I am aware of other techniques to solve this problem such as crossfilter, etc, but I want to be able using the server, for now.

For example, the table below: (timestamp is simplified just for clarity)

id    key    val   timestamp
1     'a'    100   1am
2     'b'    120   2am
3     'c'    130   3am
4     'd'    140   4am
5     'e'    130   5am
6     'f'    135   6am
7     'g'    136   7am
8     'h'    139   8am
9     'i'    149   9am
10    'j'    140   10am
11    'k'    140   11am
12    'l'    135   12pm

so I want to be able to run a query that will return a sample of size 3, for example, and it should return rows 1, 5 and 9.

I dont want to use the id, because my table is more complicated than this, and I will be applying where clauses, etc to my query, and so using ID is not going to work.

From working with other RDBSs I am aware of RANK, but it doesnot seem that it exists in mysql, I saw workarounds, like the one here, but I don't think it is a clean way to write mysql.

Any suggestions on how to approach this problem?

Community
  • 1
  • 1
Isaac
  • 2,701
  • 4
  • 30
  • 47
  • 1
    Consider [this answered SO question about selecting every nth row from a MySQL table](http://stackoverflow.com/questions/858746/how-do-you-select-every-n-th-row-from-mysql). – Aaron Miller Jul 31 '13 at 21:11
  • Why would a sample size of 3 return rows 1,5, and 9? Why not 1,2,3 as they are each 1 hour apart? Can you better explain what your logic is? – Mike Brant Jul 31 '13 at 21:12
  • @Mike Brant 1, 5, 9 is a better sample because it stretches better between the min and max values – Isaac Jul 31 '13 at 21:15
  • @Is7aq OK but you still have not explained your logic. So it sounds like there is a component that needs to consider the total number of rows in the table... what else? – Mike Brant Jul 31 '13 at 21:17
  • @Is7aq By the way, I don't think you are going to find a simple one-query solution to the problem. I just wanted to help you along in terms of thinking through how you would solve the problem mathematically. Until you understand that, you are not going to be able to think about how you might implement queries to get what you want. – Mike Brant Jul 31 '13 at 21:19
  • @Is7aq Also it might help if you consider real-word timestamps not simplified timestamps that are ideally spaced apart from row to row. What do you to happen in that case? – Mike Brant Jul 31 '13 at 21:20
  • @AaronMiller, that's what I want, but that query doesnot give me any performance gain. I guess that's the best you can do here. – Isaac Jul 31 '13 at 21:29
  • @MikeBrant, so if the table has 9000 rows of interest, I only want an X number of points, say 200, that is equally distant from each other and start from the minimum timestamp to the maximum timestamp of those 9000 rows. Does it make sense? – Isaac Jul 31 '13 at 21:30
  • @Is7aq Are each of those 9000 rows guaranteed to be spaced equally apart in time? – Mike Brant Jul 31 '13 at 21:33
  • @MarkBrant: not in value necessarily, but in rank. – Isaac Jul 31 '13 at 21:35
  • @Is7aq I would think think that you could easily achieve this in two queries. A first one to get row count and min/max values (from this you can derive the interval between each row). Based on that, you use a WHERE condition with modulus operator (as shown in linked question above) to select rows as needed. – Mike Brant Jul 31 '13 at 21:42
  • @Is7aq You have not clarified whether the distance will be calculated based on timestamp or rank/id. E.g. you may have 5 entries with following timestamps `1:00`, `2:00`, `2:30`, `3:00`, `3:05` . Say you want 3 rows. If you go by rank it would be `1:00`,`2:30`,`3:05`. iF you go by timestamp it would be `1:00`, `2:00`, `3:00`. So which is it? Also, in case you don't have an exact match what offset (in plus minus seconds) will you accept, and if for a specific timestamp interval there is no match should there be null. Please clarify whether time intervals will be used or rank/id intervals. – Menelaos Aug 02 '14 at 11:11

2 Answers2

1

You probably need a step function to map your time stamps to a finite set of "steps". This could be expressed like this in MySQL:

--
-- `min_v` and `max_v` are respectively the first and last value value on the range
-- `samples` is the number of sample ("steps") expected
-- `value` is the actual value
-- 
CREATE FUNCTION step(min_v int, max_v int, samples int, value int)
RETURNS int DETERMINISTIC
RETURN min_v + (value - min_v) * (samples - 1) DIV (max_v-min_v)

For sake of simplicity, I used here integers instead of timestamps. You will easily find in the MySQL documentation how to convert from timestamps to "unix epoch".

Once the function is defined, you just have to group by "step" in your select query, keeping only the first sample of each step:

select data.k, data.value FROM tbl AS data
join (select id, MIN(ts) FROM tbl GROUP BY step(1,12,4,ts) ) as s
on s.id = data.id;

See http://sqlfiddle.com/#!2/d5ccb/3 for a live example.

Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125
1

Do this with basic arithmetic. If you assume (as in your example) that the times are precise, and you know the first time stamp you want and the hours between them:

select t.*
from t
where mod(TIME_TO_SEC(TIMEDIFF(t.timestamp, @FirstTimeStamp)), 60*60*@HourDIff) = 0;

To calculate the hours between, just take the integer portion of the difference between the min and max timestamp:

select (TIME_TO_SEC(TIMEDIFF(max(t.timestamp), min(t.timestamp)) / @YOURCOUNT) as DiffSeconds;

Further, lets assume that the first time stamp is the first one in the table.

Now, let's put this together:

select t.*
from t cross join
     (select min(timestamp) as FirstTimeStamp,
             select (TIME_TO_SEC(TIMEDIFF(max(t.timestamp), min(t.timestamp)) / @YOURCOUNT) as DiffSeconds
      from t
     ) const
where mod(TIME_TO_SEC(TIMEDIFF(t.timestamp, FirstTimeStamp)), DiffSeconds) = 0;

This does assume that your timestamps are perfectly accurate. If that is a problem, perhaps you should use id instead.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786