ActiveRecord: load a corresponding array of records from an array of primary keys (preserve order, duplicates, maximize performance)

Question

(Was: Reverse Eager Loading in ActiveRecord)

I have this weird problem, where I know I need to use eager loading, but since this is such a weird use case, it doesn't work that well.

The Code:

class Task < ActiveRecord::Base
 belongs_to :project

class Project < ActiveRecord::Base
 has_many :tasks

The Problem:

I know that in traditional settings where you have a Project and want to render the tasks, you use eager-loading to load the tasks once instead of iterate over them sequentially. However, in my case, I have a list of tasks, and for each task I need to fetch the appropriate project. When rendering sequentially, Rails SQL cache helps, but I have a lot of tasks, so I end up loading the same Project over and over again.

What can I do to avoid this messy situation?

Edit:

Im trying to clarify the situation. I have multiple arrays of task ids. i.e.

type_a_tasks = [1,2,3,1,2,3]
type_b_tasks = [1,2,2,3,3]

Note that there can be the same tasks. Now I want to, like in functional programming, map the lists so that instead of the ids, I get the actual tasks, with their associations

type_a_tasks = [Task #1, Task #2, etc.]
type_b_tasks = [Task #1, Task #2, etc.]

I know I could just get the tasks by

Task.includes(:project).find(task_a_tasks.concat(task_b_tasks))

but then I reduce it to the Set of tasks and lose the order of my collections. Is that clearer?

I don't quite understand - it would seem the solution is to simply eager load the projects for the tasks? I.e., Task.includes(:project) ..? Perhaps more detail is required in the question. — jstephenson, Nov 26 '12 at 00:00
@vladr I suspect if that is the case, such a caching strategy might prove useful for the Tasks too, so a winner all around. — jstephenson, Nov 26 '12 at 00:18
@jstephenson perhaps he does not have all tasks available upfront. @nambrot IIRC [the query cache](http://railspikes.com/2008/8/18/disabling-activerecord-query-caching-when-needed) works even when the repeating SQL queries do not occur sequentially (have you tried explicitly wrapping it all with `Project.cache { ... }`?). If no luck then you also have the option of a [global `Project` cache](http://stackoverflow.com/questions/3860472/how-can-i-cache-model-objects-in-rails), but you should absolutely try to get the `ActiveRecord` caching working first. — vladr, Nov 26 '12 at 00:22
Hey guys, thanks for the suggestions, @jstepheson: I get the tasks from somewhere else as a huge list, so effectively I need to load the associations somehow apart from the eager load, right? @ vladr: query cache works, but I have a lot of tasks, so still not enough. I wanted to avoid a global cache, but I know that works as well. I just thought there'd be a more railsy way of doing it — nambrot, Nov 26 '12 at 00:42
@nambrot if you already have the list of tasks why does loading them with `:include => :project` not do what you want, i.e. load all projects whose IDs are all distinct `project_id`s in the given tasks? What does the SQL log look like with `:include => :project`, and what would you like to see changed therein? — vladr, Nov 26 '12 at 01:09
@vladr thanks again for the quick response. Basically, I have multiple ordered arrays of task IDs, to which I need to load the tasks and their associations to. Afaik, if I use the AR methods (find) I lose my order in my arrays since AR just retrieves the record, but leaves me with the work in building up the arrays again? — nambrot, Nov 26 '12 at 12:42
@nambrot yes do this as a two-pass, `find` them *all* first (with `:include => :project`) then re-order the records based on the ID ordering in the original arrays, or else your get-each-task DB roundtrips will kill your performance. You can create a separate question on how to achieve this reordering, but it's very simple to do in Ruby (call `find` with your ID array to get an AR array, then make your ID array into a map of ID => position-in-ID-array, then `sort` your AR array with a function that simply compares `map[AR1.id]` to `map[AR2.id]` and presto, your AR array is sorted) — vladr, Nov 26 '12 at 20:48
@vladr that's what I ended up doing, but I really wished there was a railsy way of doing it. I think there is generally not a whole lot of 'functional programming" type of approaches in ActiveRecord. Also if you can add your answer, so that I can chose yours — nambrot, Nov 26 '12 at 23:21

score 2 · Accepted Answer · edited May 23 '17 at 10:24

Let's start with the most obvious approach first:

type_a_task_ids = [1,2,3,1,2,3]
type_b_task_ids = [1,2,2,3,3] 
type_a_tasks = type_a_task_ids.map { |task_id| Task.includes(:project).find(task_id) }
type_b_tasks = type_b_task_ids.map { |task_id| Task.includes(:project).find(task_id) }

The above is simple, readable but potentially slow: it will perform one database round-trip for each distinct task_id as well as one database round-trip for each distinct project_id in the given tasks. All the latency adds up, so you want to load the tasks (and corresponding projects) in bulk.

It would be great if you could have Rails bulk-load (prefetch) and cache those same records upfront in, say, two round-trips (one for all distinct tasks and one for all distinct associated projects), and then just have the exact same code as above -- except find would always hit the cache instead of the database.

Unfortunately things don't quite work that way (by default) in Rails, as ActiveRecord uses a query cache. Running Task.find(1) (SELECT * FROM tasks WHERE id=1) after Task.find([1,2,3]) (SELECT * FROM tasks WHERE id IN (1,2,3)) will not leverage the query cache since the first query is different from the second. (Running Task.find(1) a second, third etc. time will leverage the query cache, though, as Rails will see the exact same SELECT query fly by multiple times and return the cached result sets.)

Enter IdentityMap caching. Identity Map Caching is different in the sense that it caches records, not queries, on a per-table-and-primary-key basis. Thus, running Task.find([1,2,3]) would fill out three records in the Identity Map Cache for table tasks (the entries with IDs 1, 2 and 3 respectively), and a subsequent Task.find(1) would promptly return the cached record for table tasks and ID 1.

# with IdentityMap turned on (see IdentityMap documentation)
# prefetch all distinct tasks and their associated projects
# throw away the result, we only want to prep the cache
Task.includes(:project).find(type_a_task_ids & type_b_task_ids)
# proceed with regular logic
type_a_task_ids = [1,2,3,1,2,3]
type_b_task_ids = [1,2,2,3,3] 
type_a_tasks = type_a_task_ids.map { |task_id| Task.includes(:project).find(task_id) }
type_b_tasks = type_b_task_ids.map { |task_id| Task.includes(:project).find(task_id) }

However, IdentityMap has never been active by default (for good reason), and was ultimately removed from Rails.

How do you achieve the same result without IdentityMap? Simple:

# prefetch all distinct tasks and their associated projects
# store the result in our own identity cache
my_tasks_identity_map = \
  Hash[Task.includes(:project).find(type_a_task_ids & type_b_task_ids).map { |task|
    [ task.id, task ]
  }]
# proceed with cache-centric logic
type_a_task_ids = [1,2,3,1,2,3]
type_b_task_ids = [1,2,2,3,3] 
type_a_tasks = type_a_task_ids.map { |task_id| my_tasks_identity_map[task_id] }
type_b_tasks = type_b_task_ids.map { |task_id| my_tasks_identity_map[task_id] }

score 0 · Answer 2 · answered Nov 26 '12 at 06:06

0

I think I see your problem, which is that if you have a bunch of Tasks that all belong to the same project, you will be loading that project multiple times.

Assuming you already have an array of the Task objects, how about this?

project_ids = @tasks.map{|task| task.project_id}.uniq
@projects = Project.find(project_ids)

answered Nov 26 '12 at 06:06

vpsz

457
3
6

thanks for your answer. I have tried to augment my question. The essense is that I need the project objects to be set on their task objects for further processing. – nambrot Nov 26 '12 at 12:52

score 0 · Answer 3 · answered Nov 27 '12 at 17:44

0

If you enable the IdentityMap in Rails via a line like this in config/application.rb:

config.active_record.identity_map = true

Then ActiveRecord will not in fact go back to the DB to load a Project it has already loaded before - it will just reference that same object in memory.

answered Nov 27 '12 at 17:44

Cody Caughlan

32,456
5
63
68

`IdentityMap` is being removed from Rails -- did you read the other answer before posting? – vladr Nov 28 '12 at 00:57
@vladr if you read thru that pull request at the end they come to the conclusion that it will NOT be removed and its up to further discussion. And sorry, I did scan the other answers but I missed your reference to IdentityMap. – Cody Caughlan Nov 28 '12 at 18:55
Yes, I actually *did* read through that pull request, to the end, where **the removal is [merged into rails/master](https://github.com/rails/rails/commit/795062282e072f289918688e978a0cf24e6d3aa5)**, 9 months ago. :) [It really is gone from `activerecord/lib/active_record`](https://github.com/rails/rails/tree/master/activerecord/lib/active_record). – vladr Nov 29 '12 at 00:08
@vladr I see, yes, you're right. I read through that pull request and it appeared it wasn't removed, hence I was confused. Well thats too bad, I loved the IM feature and never experienced any issues with it, and I have a couple of pretty complex apps in production. Oh well, guess we have more round-trips to the database. – Cody Caughlan Nov 29 '12 at 01:22

ActiveRecord: load a corresponding array of records from an array of primary keys (preserve order, duplicates, maximize performance)

3 Answers3