10

I have a Rails app, which uses postgresql for a database, that sorts different types of users by location, and then by reputation points they receive for various activities on the site. This is an example query

 @lawyersbylocation = User.lawyers_by_province(province).sort_by{ |u| -u.total_votes }

The query calls the scope lawyers_by_province on the User.rb model:

 scope :lawyers_by_province, lambda {|province|
  joins(:contact).
  where( contacts: {province_id: province},
         users: {lawyer: true})

  }

And then, still on the User.rb model, it calculates reputation points they have.

 def total_votes
    answerkarma = AnswerVote.joins(:answer).where(answers: {user_id: self.id}).sum('value') 
    contributionkarma = Contribution.where(user_id: self.id).sum('value')
    bestanswer = BestAnswer.joins(:answer).where(answers: {user_id: self.id}).sum('value') 
    answerkarma + contributionkarma + bestanswer
 end

I've been told that if the site reaches a certain number of users, then it will become incredibly slow because it's sorting in Ruby rather than at the database level. I know that comment refers to the total_votes method, but I'm not sure if the lawyers_by_province is happening at the database level or in ruby, in that it's using Rails helpers to query the db. Seems like a mix of both to me, but I'm not sure about the effect of that on efficiency.

Can you show me how to write this so that the query is happening at the db level and therefore in a more efficient way that won't break my site?

Update Here are the three schemes for models in total_votes method.

 create_table "answer_votes", force: true do |t|
    t.integer  "answer_id"
    t.integer  "user_id"
    t.integer  "value"
    t.boolean  "lawyervote"
    t.boolean  "studentvote"
    t.datetime "created_at"
    t.datetime "updated_at"
  end

  add_index "answer_votes", ["answer_id"], name: "index_answer_votes_on_answer_id", using: :btree
  add_index "answer_votes", ["lawyervote"], name: "index_answer_votes_on_lawyervote", using: :btree
  add_index "answer_votes", ["studentvote"], name: "index_answer_votes_on_studentvote", using: :btree
  add_index "answer_votes", ["user_id"], name: "index_answer_votes_on_user_id", using: :btree



create_table "best_answers", force: true do |t|
    t.integer  "answer_id"
    t.integer  "user_id"
    t.integer  "value"
    t.datetime "created_at"
    t.datetime "updated_at"
    t.integer  "question_id"
  end

  add_index "best_answers", ["answer_id"], name: "index_best_answers_on_answer_id", using: :btree
  add_index "best_answers", ["user_id"], name: "index_best_answers_on_user_id", using: :btree



create_table "contributions", force: true do |t|
    t.integer  "user_id"
    t.integer  "answer_id"
    t.integer  "value"
    t.datetime "created_at"
    t.datetime "updated_at"
  end

  add_index "contributions", ["answer_id"], name: "index_contributions_on_answer_id", using: :btree
  add_index "contributions", ["user_id"], name: "index_contributions_on_user_id", using: :btree

Also, here is the contact scheme which contains the province_id used in the lawyers_by_province scope on user.rb model

  create_table "contacts", force: true do |t|
    t.string   "firm"
    t.string   "address"
    t.integer  "province_id"
    t.string   "city"
    t.string   "postalcode"
    t.string   "mobile"
    t.string   "office"
    t.integer  "user_id"
    t.string   "website"
    t.datetime "created_at"
    t.datetime "updated_at"
  end

Update Trying to apply the answer by @Shawn, I put this method in the user.rb model

 def self.total_vote_sql
    "( " +
    [
     AnswerVote.joins(:answer).select("user_id, value"),
     Contribution.select("user_id, value"),
     BestAnswer.joins(:answer).select("user_id, value")
    ].map(&:to_sql) * " UNION ALL " + 
    ") as total_votes "
  end

and then in the controller, I did this (putting User in front of total_vote_sql)

@lawyersbyprovince = User.select("users.*, sum(total_votes.value) as total_votes").joins("left outer join #{User.total_vote_sql} on users.id = total_votes.user_id").
                            order("total_votes desc").lawyers_by_province(province)

It's giving me this error

ActiveRecord::StatementInvalid in LawyerProfilesController#index

PG::Error: ERROR: column reference "user_id" is ambiguous LINE 1: ..."user_id" = "users"."id" left outer join ( SELECT user_id, v... ^ : SELECT users.*, sum(total_votes.value) as total_votes FROM "users" INNER JOIN "contacts" ON "contacts"."user_id" = "users"."id" left outer join ( SELECT user_id, value FROM "answer_votes" INNER JOIN "answers" ON "answers"."id" = "answer_votes"."answer_id" UNION ALL SELECT user_id, value FROM "contributions" UNION ALL SELECT user_id, value FROM "best_answers" INNER JOIN "answers" ON "answers"."id" = "best_answers"."answer_id") as total_votes on users.id = total_votes.user_id WHERE "contacts"."province_id" = 6 AND "users"."lawyer" = 't' ORDER BY total_votes desc

Update After applying edits to Shawn's post, the error message is now this:

PG::Error: ERROR: column reference "user_id" is ambiguous LINE 1: ..."user_id" = "users"."id" left outer join ( SELECT user_id as... ^ : SELECT users.*, sum(total_votes.value) as total_votes FROM "users" INNER JOIN "contacts" ON "contacts"."user_id" = "users"."id" left outer join ( SELECT user_id as tv_user_id, value FROM "answer_votes" INNER JOIN "answers" ON "answers"."id" = "answer_votes"."answer_id" UNION ALL SELECT user_id as tv_user_id, value FROM "contributions" UNION ALL SELECT user_id as tv_user_id, value FROM "best_answers" INNER JOIN "answers" ON "answers"."id" = "best_answers"."answer_id") as total_votes on users.id = total_votes.tv_user_id WHERE "contacts"."province_id" = 6 AND "users"."lawyer" = 't' ORDER BY total_votes desc
David Aldridge
  • 51,479
  • 8
  • 68
  • 96
BrainLikeADullPencil
  • 11,313
  • 24
  • 78
  • 134
  • Actually, your scope `:lawyers_by_province` is translated to SQL before being run, so far it's ok. The total_votes method... well I might not have the best answer here nor the right one but I think the simplest way you could achieve sorting and everything is by writing pure SQL `ActiveRecord::Base.connection.execute("SQL here")`. – Raindal May 17 '13 at 15:50
  • @Sparda ok, thanks, or is it ok to add the total_votes code inside the lawyers_by_province scope? Either way, could you help with the sql and where would I put that ActiveRecord code? inside a method? – BrainLikeADullPencil May 17 '13 at 16:05
  • Well it depends on whether you often call `:lawyers_by_province` without sorting or not. It you do then the scope is justified and I guess it is not a problem to do the sorting stuff in another method. If both are always called together then mixing them together may be a good idea. Now for the SQL I would write a class method. I'll try to write an answer. – Raindal May 17 '13 at 16:24
  • @Sparda I'll call them together most of the time. – BrainLikeADullPencil May 17 '13 at 16:32
  • Ok so I would suggest converting your total_votes's 3 queries to SQL ones. – Raindal May 17 '13 at 16:35
  • @sparda I'll probably do both, as the app's not finished yet. If you have tips on how to write the method/sql for the total_votes, I'd be really grateful if you could share. Thanks – BrainLikeADullPencil May 17 '13 at 17:37
  • Actually it's not something I can come out with as is, I would need an environment configured as yours to test so I can't really provide a working answer on how to write the SQL. – Raindal May 17 '13 at 17:42
  • Please post the actual SQL and schemas. It looks like you could maintain a karma field for each lawyer, and this would yield the fastest plan with a proper index. – Denis de Bernardy May 17 '13 at 19:17
  • @Denis I posted the scheme for the three models in the total_votes method, as well as the schema for the contacts model that is referred to in the lawyers_by_province scope of the user.rb model. – BrainLikeADullPencil May 17 '13 at 19:34
  • Consider adding a karma field to your contacts table that is maintained using triggers on insert, update and delete on your three answer tables. This will yield the better plan in that you'll be able to order by karma directly, with an index on `(location_id, karma desc)`. – Denis de Bernardy May 17 '13 at 19:42
  • @Denis adding the karma field is easy enough. I assume by location_id you meant province_id? – BrainLikeADullPencil May 17 '13 at 20:21
  • Yeah... Then, precompute the karma (ideally using triggers, but Rails works too if you're more comfortable with that), and you won't need to worry about convoluted queries. – Denis de Bernardy May 18 '13 at 15:21
  • @Denis so when you say to add a karma field to contacts that is 'maintained' do you mean that anytime points are awarded to a user that I should then update the contacts.karma field? – BrainLikeADullPencil May 23 '13 at 14:38

5 Answers5

8

Caveat: I'm quite new to Rails, but this is my technique for keeping sane while needing to continually go straight to the database for performance reasons, which I need to do all the time because you can only have two of the following

  1. Processing of bulk data
  2. A pure-Rails technique
  3. Good performance

Anyway, once you need to get into these hybrid methodologies, which are part-ruby part-SQL I feel like you might as well go the whole hog and opt for a pure SQL solution.

  1. It's easier to debug because you're isolating the two code layers more effectively.
  2. It's easier to optimise the SQL because you stand a better chance of getting a dedicated SQL person to look at it for you if it's not your strong-point.

I think the SQL that you're looking for here is along the lines of:

with cte_scoring as (
  select
    users.id,
    (select Coalesce(sum(value),0) from answer_votes  where answer_votes.user_id  = users.id) +
    (select Coalesce(sum(value),0) from best_answers  where best_answers.user_id  = users.id) +
    (select Coalesce(sum(value),0) from contributions where contributions.user_id = users.id) total_score
  from
    users join
    contacts on (contacts.user_id = users.id)
  where
    users.lawyer         = 'true'          and
    contacts.province_id = #{province.id})
select   id,
         total_score
from     cte_scoring
order by total_score desc
limit    #{limit_number}

This ought to give you the best possible performance -- the subqueries in the SELECT are not ideal but the technique does apply filtering on which user_id you're checking the score for.

Integrating into Rails: If you define sql_string as the SQL code:

scoring = ActiveRecord::Base.connection.execute sql_string

... then you get an array of hashes back that you work with like this:

scoring.each do |lawyer_score|
  lawyer = User.find(lawyer_score["id"])
  score  = lawyer_score["total_score"]
  ...
end
David Aldridge
  • 51,479
  • 8
  • 68
  • 96
  • I think your analysis is spot-on and is the best bet. Go straight to the DB. – Cody Caughlan May 23 '13 at 17:12
  • Just to make sure we're on the same page I wrapped the sql statement in double quotes (because `true` was in single quotes). However, it's throwing an error `PG::Error: ERROR: syntax error at or near "." LINE 9: contacts on (contacts.user_id = user.id) ^ :` I'm not sure what the problem is since Contact.rb has a user_id:integer column – BrainLikeADullPencil May 23 '13 at 18:24
  • Should have been "contacts.user_id = users.id" – David Aldridge May 23 '13 at 18:28
  • thx, made that change, now it's saying `PG::Error: ERROR: missing FROM-clause entry for table "answer_votes" LINE 5:` I looked into this error (http://stackoverflow.com/questions/9643859/postgres-missing-from-clause-entry-error-on-query-with-with-clause) but I can't figure out where the from clause should go (and wonder why the error exists anyways, since there's already a from clause) – BrainLikeADullPencil May 23 '13 at 20:25
  • Sorry, sloppy on my part. Fixed i think – David Aldridge May 23 '13 at 20:39
  • ok, thanks, I got it to work, however, I now realize it's probably important for you to know how the data will be presented...I'm using this to sort through a whole database (assuming users sign up) of potentially thousands of users, and then to list them with pagination. Therefore, to do User.find on each of them after I sort seems like the wrong approach, and I do need the instance for each user. I'd like to sort and get the instance at the same time if possible. – BrainLikeADullPencil May 24 '13 at 02:15
  • actually, I can just get the other information I need (name, email) on the same sql query, without doing User.find again...Thanks, I learned a lot. going to try your other answer and compare speeds. – BrainLikeADullPencil May 24 '13 at 03:04
  • My thinking was that you'd only be looking for the Top-n results, but if you really do want thousands then I'd definitely go for the other solution. – David Aldridge May 24 '13 at 05:39
  • Actually, you'll notice that your sql doesn't replicate the sql from total_votes in OP (i.e. it creates a different total). In the OP, you'll notice that AnswerVote (as BestAnswer does also) joins Answer and then retrieves records where the Answer user.id = self.id `AnswerVote.joins(:answer).where(answers: {user_id: self.id}).sum('value')` The user id on AnswerVote will not be the same as user id on Answer (same with BestAnnswer). Probably a bad way to write the code (as it requires the join of Answer) but I wonder if you'd show how you'd add that to your sql statement (if you have time). – BrainLikeADullPencil May 24 '13 at 14:53
2

Do you really want to dynamically calculate a User's reputation everytime? The proper way is to pre-calculate a User's reputation. In Rails you would do that as follows:

# app/models/reputation_change_observer.rb
class ReputationChangeObserver < ActiveRecord::Observer
  observe :answer, :contribution # observe things linked to a users reputation

  def after_update(record)
    record.user.update_reputation
  end
end

# app/models/user.rb
class User
  # Add a column called "reputation"

  def update_reputation
    answerkarma = AnswerVote.joins(:answer).where(answers: {user_id: self.id}).sum('value') 
    contributionkarma = Contribution.where(user_id: self.id).sum('value')
    bestanswer = BestAnswer.joins(:answer).where(answers: {user_id: self.id}).sum('value') 
    total_votes = contributionkarma + bestanswer

    # Save the updated reputation in the "reputation" field
    self.update_attribute :reputation, total_votes
  end
end

This way, the reputation will be computed only once, and it will get stored in the database. You would then just sort using plain SQL: User.order_by(:reputation).

If your site is still growing a lot, then you two options:

  1. Wait 10-15 minutes before recalculating reputation for the same user (use a separate column called reputation_timestamp to track when the user's reputation was last calculated)

  2. Whenever the user posts an Answer/Contribution, just set a flag in the user called reputation_recalc => true. Later run a background job every 10-15 minutes, query all users who have reputation_recalc: true and calculate their reputation using the same update_reputation method.

Edit: Small comment in the code, and minor formatting, comment for user class

Subhas
  • 14,290
  • 1
  • 29
  • 37
  • Thanks for this. I'm going to play around with it and get back to you. – BrainLikeADullPencil May 26 '13 at 19:56
  • Great Answer! :) Although the Observer pattern is being removed in Rails 4. :( http://stackoverflow.com/questions/15165260/rails-observer-alternatives-for-4-0 Also using memcached to cache the array(s) might help too, as it sounds like you don't need "realtime" stats. – engineerDave May 30 '13 at 14:10
1

Take the union of your total vote queries, make it a subquery, join that to your users query. This also gives you the total_votes attribute.

def self.total_vote_sql
    "(select user_id, sum(value) as total_votes from ( " +
    [
     AnswerVote.joins(:answer).select("answers.user_id, value"),
     Contribution.select("user_id, value"),
     BestAnswer.joins(:answer).select("answers.user_id, value")
    ].map(&:to_sql) * " UNION ALL " + 
    ") as total_votes group by user_id) as tv "
end

User.select("users.*, tv.total_votes").
joins("left outer join #{User.total_vote_sql} on users.id = tv.user_id").
order("total_votes desc").lawyers_by_province(province)

(Note, I tested this on mysql, but postgres should be similar, you might need to also group by.) You may also want to benchmark this vs adding the joins to user in the subquery.

The total_vote_sql method just gets the value and user_id from each table, generates the sql on each one and then joins them with UNION.


I edited the post to get around the ambiguous column name error. It was creating a conflict with the joins in lawyers_by_province.


I also edited to resolve the ambiguous user_id between answer_votes and answers and best_answers and answers.


I added an outer subquery to the join to perform the group_by needed for the sum.

Shawn Balestracci
  • 7,380
  • 1
  • 34
  • 52
  • The union operator here will make things dreadfully slow in the long run... Especially if you join on the resulting set. – Denis de Bernardy May 17 '13 at 19:13
  • I should have specified UNION ALL to eliminate unnecessary sorting and removal of duplicates. – Shawn Balestracci May 17 '13 at 20:39
  • Thanks a lot for this. I'll try it out and get back to you once I make sure it works with postgres. – BrainLikeADullPencil May 17 '13 at 23:38
  • I'm getting an error message from your code to long for the comments section. I'll update the OP if you can help. – BrainLikeADullPencil May 23 '13 at 14:33
  • @ShawnBalestracci thanks, but it doesn't fix the problem. I'll post error in OP but I think it's basically the same. – BrainLikeADullPencil May 23 '13 at 15:58
  • @BrainLikeADullPencil oh, answers and bestanswers both have user_id, I missed that before, sorry. Edited again. – Shawn Balestracci May 23 '13 at 16:12
  • Thanks but another error. I'm so new to queries that I don't really know how fix complex ones like this is (in relation to what i'm used to) `PG::Error: ERROR: column "users.id" must appear in the GROUP BY clause or be used in an aggregate function LINE 1:` – BrainLikeADullPencil May 23 '13 at 17:57
  • edited to add the group by that postgres requires but mysql does not. – Shawn Balestracci May 23 '13 at 19:59
  • Holy moly, thanks for your patience! Still not working. this is the prob. When I use your code exactly as is, it says `undefined local variable or method `total_vote_sql' ` That's why I then changed it to `User.total_vote_sql` but when I do that, I'm still getting this `PG::Error: ERROR: column "users.id" must appear in the GROUP BY clause or be used in an aggregate function LINE 1:` – BrainLikeADullPencil May 23 '13 at 20:32
  • do you know why I might still be getting that group error even though you added the group at the end? – BrainLikeADullPencil May 24 '13 at 02:13
  • You might have to list all the users fields in that group by, it looks like you can't group by * in postgres. The other option would be to add another layer that does the group_by users.id and sum and join that to the outer query. It would take some fiddling for me to come up with the syntax for this second option, I won't be able to get to that for a few days. – Shawn Balestracci May 24 '13 at 02:33
  • @BrainLikeADullPencil I added an outer subquery to the join for the `group by` required for sum. – Shawn Balestracci May 28 '13 at 19:21
  • Thanks for all your effort, however, I still can't get it to work: `PG::Error: ERROR: column tv.total_votes does not exist` – BrainLikeADullPencil May 30 '13 at 14:14
  • @BrainLikeADullPencil I accidentally changed it from total_votes to total_value. I've corrected that. – Shawn Balestracci May 30 '13 at 21:52
1

A different approach that might work well for you is to maintain the total amounts at the User level with callbacks on the three scoring models:- answer_value, best_answer_value, and contribution_value (not nullable and default values of zero)

Although this is a potential locking problem on individual User records the voting process is likely to be fast enough that it would not be noticeable.

By maintaining separate columns for the three scores and creating an expression-based (and possibly partial) index you'd get very high performance queries for Top-n:

create index ..
on     users (
         id,
         answer_value + best_answer_value + contribution_value)
where  lawyer = 'true'
David Aldridge
  • 51,479
  • 8
  • 68
  • 96
  • just to make sure I understand, are you suggesting that any time, for example, a value is entered in best_answer_value to then do a before_save or after_save callback to the user model and add the score there? so that I'd have three more columns on the user model, namely best_answer_value, answer_value and contribution_value? – BrainLikeADullPencil May 23 '13 at 18:32
  • Yes, that's the plan. You essentially spread the pain of calculating the totals out over a great number of small transactions, and it avoids re-aggregating values that you have aggregated before. – David Aldridge May 23 '13 at 18:48
  • ok, thanks, could you show me how you'd amend the total_votes method in the User.rb model (see the OP for the current version) if I created those three columns on User.rb? Appreciate your help – BrainLikeADullPencil May 23 '13 at 20:17
  • Given that the values answerkarma, contributionkarma, and bestanswer are stored as attributes of the table, it simply becomes the sum of the three values, so delete the three lines that derive them from other models – David Aldridge May 23 '13 at 20:19
0

FOR sorting and filtering you can use gem 'wice_grid' its very easy to use and implement...wice grid.