11

Trying to retrieve an array of ActiveRecord Objects grouped by date with PostgreSQL.

More specifically I'm trying to translate the following MySQL query:

@posts = Post.all(:group => "date(date)", 
   :conditions => ["location_id = ? and published = ?", @location.id, true], 
   :order => "created_at DESC")

I am aware that PostgreSQL interpretation of the SQL standard is stricter than MySQL and that consequently this type of query won't work...and have read a number of posts on StackOverflow and elsewhere on the subject - but none of them seem to be the definitive answer on this subject

I've tried various combinations of queries with group by and distinct clauses without much joy - and for the moment I have a rather inelegant hack which although works makes me blush when I look at it.

What is the proper way to make such a query with Rails and PostgreSQL ? (Ignoring the fact that surely this should be abstracted away at the ActiveRecord Level)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
digitalfrost
  • 229
  • 2
  • 9
  • An "array...grouped by date" - this makes no sense. What are you trying to achieve? Can you just order by date(date)? – DanSingerman Aug 18 '10 at 13:01
  • 1
    Any database, except MySQL, will reject illegal SQL. Databases don't guess what results you would like today, db's should only get all correct results in all situations. Use ONLY_FULL_GROUP_BY in MySQL, and the above query will be rejected by MySQL as well. – Frank Heikens Aug 18 '10 at 13:37
  • Hi Dan - I am trying to obtain an array of Post objects but I only want to retrieve one Post for any given day (the latest Post for that day). – digitalfrost Aug 21 '10 at 15:55
  • It's not so much that it's stricter as it is that MySQL is willing to make assumption on some common behaviors (like auto increment, etc). If you think about what the database has to do in order to retrieve the first row in an ordered group by, it's basically: 1. Retrieve the entire set, ordered by the group column 2. Sort each subset of the group column by the order clause 3. Take the first row from each subset as a whole new set and then sort THAT by the order PG just doesn't make the assumptions so you can do the same thing explicitly with subqueries, which are very efficient in PG. – brightball Jun 06 '15 at 15:41

3 Answers3

14

The PostgreSQL feature you want to use here is DISTINCT ON. There are two basic ways to go about making this query via ActiveRecord.

The first method is to just specify the :select and :order options. This works great when you have a fairly simple query with no :joins or :include.

Post.all(
  :select => 'DISTINCT ON (date::date) *',
  :order => 'date::date DESC, created_at DESC'
)

If you have a more complex query where ActiveRecord generates its own SELECT clause, you can use a subquery to select the target records.

Post.all(
  :joins => 'INNER JOIN (SELECT DISTINCT ON (date::date) id FROM posts ORDER BY date::date DESC, created_at DESC) x ON x.id = posts.id'
)

Note that this could be a fair bit slower than the first method depending on your data. I would only use this method if required. Be sure to benchmark with production-like data.

Jason Weathered
  • 7,762
  • 2
  • 41
  • 37
1

My solution:

def self.columns_list
   column_names.collect { |c| "#{table_name}.#{c}" }.join(",")
 end

 scope :selling, joins(:products).group(columns_list)

Simple and repeatable.

Jack Kinsella
  • 4,491
  • 3
  • 38
  • 56
0

While SQL is quite simple when it comes to answering questions like "when was the most recent post for each day?" it is NOT very straight forward when you ask "which was the most recent post for each day?"

You can't retrieve the latest Post for each day without using a sub SELECT (or multiple SQL statements). This might work for you (use Post.find_by_sql or similar):

SELECT P.*, M.just_day, M.max_created_at
FROM posts P
JOIN (
  SELECT date(P2.date) AS just_day, MAX(P2.created_at) AS max_created_at
  FROM posts P2
  P.location_id='12345' AND P.published=true
  GROUP BY date(P2.date)
) AS M  
   ON AND M.max_created_at = P.created_at
WHERE P.location_id='12345' AND P.published=true

The above SQL statement should be enough if you can be certain that two posts will not have the same value in the created_at column. If you can't guarantee uniqueness in the created at column, then you either need to filter out the duplicates in Ruby (that shouldn't be too inefficient because presumably you will be looping over the list anyway) or you will need to do N+1 SQL statements. (Actually you could do per-row selects, but AFAIK that is just as inefficient as N+1 SQL statements.)

Here is how you could remove duplicates while looping:

last_post = nil
posts.each do |post|
  unless post.just_day == last_past.try(:just_day)
    # Do stuff
    last_post = post
  end
end

That said, you could write it nicely with just Ruby/ActiveRecord, if you have few enough days that a SELECT for each day isn't too bad:

days = Post.group("date(date)")
posts = days.each { |day| Post.order('created DESC').where("date(day) = ?", day) }

If you are using pagination (say 10 items per page), then this will require 11 SQL statements for each page. Not ideas, but the simplicity might be worth the inefficiency.

Honestly, if you expect this query to be both run frequently and with a reasonably large data set, then I suggest you add a boolean column called most_recent. The last post from past days won't change. You only need to worry about the posts from today. Just set up a cron job to run a few minutes after the end of the day to update the value for the last day. If you want something more up-to-date, you could have the cron job run every 5 minutes. Or if you need real-time, then add an after_save callback to set most_recent to false for all today's posts that aren't the current post.

This question is similar: MySQL: Getting highest score for a user

Community
  • 1
  • 1
phylae
  • 973
  • 1
  • 10
  • 18