4

I'm struggling how to have Ruby on Rails do this query right... in short: to join on a has_many relation but only via the most recent record in that relation and then can apply a filter/select on that relation.

Here's a super simple variant that captures my struggle:


Let's say I have a table of Employees and a table of Employments. An employee has_many employments. An employment has a status of :active or :inactive.

class Employee < ActiveRecord::Base
  has_many :employments
end

class Employment < ActiveRecord::Base
  belongs_to :employee
end

To keep it simple, let's say there is one employee: Dan and he has two employments: an old one (by created_at) that is :inactive and a new one that is :active.

dan = Employee.create(name: 'Dan')
Employment.create(employee: dan, created_at: 2.years.ago, status: :inactive)
Employment.create(employee: dan, created_at: 3.months.ago, status: :active)

So in effect, you could say: "Dan has worked twice and is currently actively employed."

What I want is the Rails query for saying: "find me the employees who are inactive". And that should return an empty set because Dan's latest employment is :active. So I can't just do: Employee.joins(:employments).where(employments: { status: :inactive }) because it would match the old employment and thus return the Dan employee record.

I need a way to say: "find the employees who are inactive based on the most recent employment record only".

But I don't know how to do that in Rails.

I feel like I'm missing something... that it should be quite simple... but I can't figure it out.

Thanks!

Dan Sharp
  • 1,209
  • 2
  • 12
  • 31
  • its is possible for old record has active and new record is inactive? – Giridharan Mar 05 '20 at 18:38
  • Please read "[ask]" and the linked pages, "[Stack Overflow question checklist](https://meta.stackoverflow.com/questions/260648)" and "[Writing The Perfect Question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/)". Show us some code rather than try to explain it. The act of putting it into the most simple example that demonstrates the problem will help you. As is your question is broad and needs to be more concise. – the Tin Man Mar 05 '20 at 18:51
  • @giridharan yes. In that case the employee would be considered inactive. In effect, the _most recent_ employment record's status is the one that matters for this question. – Dan Sharp Mar 05 '20 at 19:10
  • @theTinMan thanks for the comment. I didn't think my question was too broad. I can add some code... but it'll really just be the models and their relations. But I'll do that if that will help conceptualize the setup – Dan Sharp Mar 05 '20 at 19:12
  • I think what I need is some way to make a temporary table (or inner query) that acts as if the `employments` records ONLY include the most recent. Thus, in my example above, it would only include the 2nd employment record (the one that's `:active`). Then, the query of `employees` that joins on that `where(status: :inactive)` wouldn't find any. – Dan Sharp Mar 05 '20 at 20:50
  • Before understanding how to do this in Rails, you might first want to find out how to build the query at all. This type of query is harder than you might think. You might want to check out [Fetch the row which has the Max value for a column](https://stackoverflow.com/questions/121387/fetch-the-row-which-has-the-max-value-for-a-column) to get a feeling for where the difficulty lies. – 3limin4t0r Mar 05 '20 at 23:15
  • Is `status` an enum, btw? – Ja͢ck Mar 05 '20 at 23:16
  • Jack, yes, but it could just as easily be a boolean or whatever. It's a filter value, essentially. – Dan Sharp Mar 06 '20 at 16:33

7 Answers7

3

I struggled quite a bit with the exact same issue in an application with a huge amount of rows and after trying various novel solutions like lateral joins and subqueries the best performing and by far simplest solution was just to add a foreign key to the table that points to the latest row and use an association callback (or a db trigger) to set the foreign key.

class AddLatestEmploymentToEmployees < ActiveRecord::Migration[6.0]
  def change
    add_reference :employees, :latest_employment, foreign_key: { to_table: :employments }
  end
end

class Employee < ActiveRecord::Base
  has_many :employments, after_add: :set_latest_employment
  belongs_to :latest_employment, 
    class_name: 'Employment',
    optional: true

  private
  def set_latest_employment(employment)
    update_column(:latest_employment_id, employment.id)
  end 
end

Employee.joins(:latest_employment)
        .where(employments: { status: :active })

It really shines if the amount of associated records is huge like it was in my case as you can eager load the latest record without the memory issues which occur if you load the entire has_many association.

max
  • 96,212
  • 14
  • 104
  • 165
  • I really like this suggestion! I'll have to think about the ramifications of altering the Employee table, but I definitely see the value in having it point to the latest. – Dan Sharp Mar 06 '20 at 16:18
  • If you want to see the lateral join mentioned you can find it here https://github.com/maxcal/remote-wind/blob/d78f0d847b63c44f705fd080dae31d204b6a78a1/app/models/observation.rb#L68 – max Mar 06 '20 at 16:21
  • Thanks Max! That seems a bit too much SQL black magic to me! I haven't heard of a lateral join. Very cool! – Dan Sharp Mar 06 '20 at 16:31
3

One alternative is to use a LATERAL JOIN which is a Postgres 9.3+ specific feature which can be described as something like a SQL foreach loop.

class Employee < ApplicationRecord
  has_many :employments
  def self.in_active_employment
    lat_query = Employment.select(:status)
                      .where('employee_id = employees.id') # lateral reference
                      .order(created_at: :desc)
                      .limit(1)
    joins("JOIN LATERAL(#{lat_query.to_sql}) ce ON true")
      .where(ce: { status: 'active' })
  end
end

This fetches the latest row from employments and then uses this in the WHERE clause to filter the rows from employees.

SELECT "employees".* FROM "employees" 
JOIN LATERAL(
  SELECT "employments"."status" 
  FROM "employments" 
  WHERE (employee_id = employees.id) 
  ORDER BY "employments"."created_at" DESC 
  LIMIT 1
) ce  ON true 
WHERE "ce"."status" = $1 LIMIT $2 

This is going to be extremely fast in comparison to a WHERE id IN subquery if the data set is large. Of course the cost is limited portability.

max
  • 96,212
  • 14
  • 104
  • 165
  • Lateral joins are a part of the SQL specs and exist on other RDBMS such as MySQL. There are however noticable differences in the syntax. – max Feb 04 '23 at 09:44
1

The simplest solution (based on code complexity) I can think of is first fetching the employment ids with their maximum values, then compsing a new query with the result.

attributes = %i[employee_id created_at]
employments = Employment.group(:employee_id).maximum(:created_at)
              .map { |values| Employee.where(attributes.zip(values).to_h) }
              .reduce(Employment.none, :or)
              .where(status: :inactive)

employees = Employee.where(id: employments.select(:employee_id))

This should produce the following SQL:

SELECT employments.employee_id, MAX(employments.created_at)
FROM employments
GROUP BY employments.employee_id

With the result the following query is build:

SELECT employees.*
FROM employees
WHERE employees.id IN (
  SELECT employments.employee_id 
  FROM employments
  WHERE (
    employments.employee_id = ? AND employments.created_at = ?
    OR employments.employee_id = ? AND employments.created_at = ?
    OR employments.employee_id = ? AND employments.created_at = ?
    -- ...
  ) AND employments.status = 'inactive'
)

The above method doesn't hold up well for large amounts of records, since the query grows for each additional employee. It becomes a lot easier when we can assume the higher id is made last. In that scenario the following would do the trick:

employment_ids = Employment.select(Employment.arel_table[:id].maxiumum).group(:employee_id)
employee_ids = Employment.select(:employee_id).where(id: employment_ids, status: :inactive)
employees = Employee.where(id: employee_ids)

This should produce a single query when employees is loaded.

SELECT employees.*
FROM employees
WHERE employees.id IN (
  SELECT employments.employee_id 
  FROM employments
  WHERE employments.id IN (
    SELECT MAX(employments.id)
    FROM employments
    GROUP BY employments.employee_id
  ) AND employments.status = 'inactive'
)

This solution works a lot better with larger datasets but you might want to look into the answer of max for better lookup performance.

3limin4t0r
  • 19,353
  • 2
  • 31
  • 52
  • Nice! I just posted my own answer after poking at this a bit and realized it's really similar to what you posted here... although I like your answer a lot more in terms of more Ruby and less SQL in the code. Thanks! – Dan Sharp Mar 06 '20 at 16:28
  • also "It becomes a lot easier when we can assume the higher id is made last." <- that's a good point and likely something that I can count on being always true, thus that maxes it easier than finding the "latest" by a date – Dan Sharp Mar 06 '20 at 16:29
1

+1 to @max's answer.

An alternative though is to add a start_date and end_date attribute to Employment. To get active employees, you can do

Employee
  .joins(:employments)
  .where('end_date is NULL OR ? BETWEEN start_date AND end_date', Date.today)
jvnill
  • 29,479
  • 4
  • 83
  • 86
1

After fiddling for a while (and trying all these suggestions you all came up with, plus some others), I came up with this. It works, but maybe isn't the most elegant.

inner_query = Employment.select('distinct on(employee_id) *').order('employee_id').order('created_at DESC')
employee_ids = Employee.from("(#{inner_query.to_sql}) as unique_employments").select("unique_employments.employee_id").where("unique_employments.status='inactive'")
employees = Employee.where(id: employee_ids)

The inner query returns a collection of unique employments... the latest for each employee. Then based on that I pull the employee IDs that match the status. And last, find those employee records from the IDs

I don't love it, but it's understandable and does work.

I really appreciate all the input.

One big take-away for me (and anyone else who lands across this same/similar problem): max's answer helped me realize the struggle I was having with this code is a "smell" that the data isn't modeled in an ideal way. Per max's suggestion, if the Employee table has a reference to the latest Employment, and that's kept up-to-date and accurate, then this becomes trivially easy and fast.

Food for thought.

Dan Sharp
  • 1,209
  • 2
  • 12
  • 31
0

Since the title includes ARel. The following should work for your example:

employees = Employee.arel_table
employments = Employment.arel_table
max_employments = Arel::Table.new('max_employments')
e2 = employments.project(
      employments['employee_id'], 
      employments['id'].maximum.as('max_id')
     ).group(employments['employee_id'])
me_alias = Arel::Nodes::As.new(e2,max_employments)

res = employees.project(Arel.star)
      .join(me_alias).on(max_employments['employee_id'].eq(employees['id'])).
      .join(employments).on(employments['id'].eq(max_employments['max_id']))


Employee.joins(*res.join_sources)
  .where(employments: {status: :inactive})

This should result in the following

SELECT employees.* 
FROM employees 
INNER JOIN (
    SELECT 
       employments.employee_id, 
       MAX(employments.id) AS max_id 
    FROM employments 
    GROUP BY employments.employee_id
    ) AS max_employments ON max_employments.employee_id = employees.id 
INNER JOIN employments ON employments.id = max_employments.max_id
WHERE 
  employments.status = 'inactive'
engineersmnky
  • 25,495
  • 2
  • 36
  • 52
  • I tried this but got: `TypeError (Cannot visit Arel::SelectManager)` Not sure why, though. The SQL makes sense and even though complicated does seem to represent what I'm looking for – Dan Sharp Mar 05 '20 at 20:45
0

In my opinion you can get those max dates first to sure not getting old records and then just filter for the required status. Here was the example of doing first part of it

https://stackoverflow.com/a/18222124/10057981

  • Victor! that's a really great link... I appreciate seeing how the subselect is extremely non-performant because it's run each time (for the outer select's records). – Dan Sharp Mar 06 '20 at 16:44
  • Yep, sub queries are always tricky. Not like it’s the topic case, but there was even an recommendation to prefer using joins instead of subqueries if possible in some old MySQL docs - one should be very accurate when choosing between joins/subqueries in each specific scenario. – Victor Shinkevich Mar 07 '20 at 02:17