2

I have a large database I use for plotting and data examination. For simplicity, say it looks something like this:

|    id    |    day    |    obs    |
+----------+-----------+-----------+
|    1     |    500    |    4.5    | 
|    2     |    500    |    4.4    | 
|    3     |    500    |    4.7    | 
|    4     |    500    |    4.8    | 
|    5     |    600    |    5.1    | 
|    6     |    600    |    5.2    | 
                ...

This could be stock market data, where we have many points per day that are measured.

What I want to do is look at much longer trends, where the multiple points per day are unnecessarily resolved, and clog my plotting application. (I want to look at 30000 days, each has about 100 observations).

Is there a way to do something like SELECT ... LIMIT 1 PER "day"

I suppose I could perform a few SELECT DISTINCT queries to find correct ID's, but I'd rather do something simple if it is built in.

It doesn't matter if its the first, last, or an average value per day. Just a single value. I just prefer what is fastest.

Also, this I'd like to do this for Postgres, MySQL, and SQLite. My application is built to use all three and I frequently switch between them.

Thanks!

Background: This is for a Ruby on Rails plotting application, so a trick with ActiveRecord will work too. https://github.com/ZachDischner/Rails-Plotter

zachd1_618
  • 4,210
  • 6
  • 34
  • 47
  • Are you trying to only select one value for each day or an average value for each day? – Kevin Holditch Feb 15 '13 at 16:37
  • See here: [How to select the first/least/max row per group in SQL](http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/). – Joseph Silber Feb 15 '13 at 16:38
  • I guess I should have specified-but it doesn't matter much to me. Long trends can take a random daily value or an average. Average might just take more time to perform the query – zachd1_618 Feb 15 '13 at 16:44

2 Answers2

3

You need to tag your question with the brand of RDBMS you're using. Frequently for Rails developers, they're using MySQL, but the answer to your question depends on this.

For all brands except for MySQL, the correct and standard solution is to use windowing functions:

SELECT * FROM (
  SELECT ROW_NUMBER() OVER (PARTITION BY day) AS RN, *
  FROM stockmarketdata
) AS t
WHERE t.RN = 1;

For MySQL, which doesn't support windowing functions yet, you can simulate them in a kind of clumsy way with session variables:

SELECT * FROM (SELECT @day:=0, @r:=0) AS _init,
(
  SELECT IF(day=@day, @r:=@r+1, @r:=0) AS RN, @day:=day AS d, *
  FROM stockmarketdata
) AS t
WHERE t.RN = 1
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • This is great! I'll keep it open for another day or two to see if any more answers pop up. But this is exactly what I'm looking for! – zachd1_618 Feb 15 '13 at 17:06
1

You left a lot of room for options with your statement:

It doesn't matter if its the first, last, or an average value per day. Just a single value. I just prefer what is fastest.

So, I'm going to leave the id out of it and first propose going with average of obs for each group as the simplest and probably the most practical, though maybe not the fastest to be running stat functions vs. limit:

MyModel.group(:day).average(:obs)

If you wanted the minimum:

MyModel.group(:day).minimum(:obs)

If you wanted the maximum:

MyModel.group(:day).maximum(:obs)

(Note: The following 2 examples are less efficient than just entering the SQL, but might be more portable.)

But you might want all three:

ActiveRecord::Base.connection.execute(MyModel.select('MIN(obs), AVG(obs), MAX(obs)').group(:day).to_sql).to_a

Or just the data without hashes:

ActiveRecord::Base.connection.exec_query(MyModel.select('MIN(obs), AVG(obs), MAX(obs)').group(:day).to_sql)

If you want median, see this question which is more DB specific, and there are other related posts about it if you search.

And for more, some DB's like postgres have variance(...), stddev(...), etc. built-in.

Finally, check out the query section in the Rails guide and ARel for more info on constructing queries. You can do a limit in an ActiveRecord relation via first or limit for example, and in ARel, take lets you do a limit. Subqueries are possible too, as shown in answers to this question, and so is group by, etc. If you are sharing this project with others, try to limit the amount of non-portable SQL you are using unless you plan on adding support for other databases on your own and maintaining that.

Community
  • 1
  • 1
Gary S. Weaver
  • 7,966
  • 4
  • 37
  • 61
  • Thanks for the answer. This idea works too, but I'm going to verify the other solution for the sake of speed. I appreciate it! – zachd1_618 Feb 19 '13 at 16:56
  • Remember that data quality can often be more important than speed. For example, the first or last observed data point could potentially hide much higher or lower values. Outdoor temperature is a great example of this; if you measure only in the afternoon, the temperature is often going to be higher than in the early morning. If you must take only one measurement, perhaps order by RAND()/RANDOM(), etc. – Gary S. Weaver Feb 19 '13 at 18:38
  • Absolutely. I agree and appreciate the input. I'm going to try to implement both methods and see what works. Working off your example, you are right that temps change thoughout the day. But if I'm looking at the long term (50 year) trends for changes (relative, not absolute), then it won't matter if I'm recording in the morning or at 3pm. Thats the sort of data I'm looking at. Thanks for the input! – zachd1_618 Feb 20 '13 at 20:04