11

I have a massive table Foo from which I need to pluck all values in a certain field, Foo.who.

The array has millions of rows, but only a few thousand different values in the who column.

If the table was smaller of course I'd simply use Foo.pluck(:who)

If I use Foo.find_in_batches do |a_batch| each set is an Array of Foo records, rather than an activerecord collection of Foo records, so I cannot use .pluck() and AFAIK the only way to extract the who column is via something like .map(&:who) that iterates over the array.

Is there a way to pluck the who column from Foo in batches that does not require then iterating over each element of each batch to extract the who column?

jpw
  • 18,697
  • 25
  • 111
  • 187
  • 2
    for rails 4+, check [this gist](https://gist.github.com/siannopollo/03d646eb7525f7fce678#file-pluck_in_batches-rb) for a `pluck_in_batches` method – allenwlee Mar 13 '15 at 21:07

3 Answers3

14

In Rails 5 you can use:

Foo.in_batches do |relation|
  values = relation.pluck(:id, :name, description)
  ...
end

Upd: for prevent memory leaks use:

Foo.uncached do
  Foo.in_batches do |relation|
    values = relation.pluck(:id, :name, description)
    ...
  end
end
Dmitry Ukolov
  • 658
  • 7
  • 10
  • 4
    Could you elaborate about memory leak issue and how/why does `Foo.uncached` help? I'm also interested if this would also apply for Rails 6.0/6.1. – GuiGS Mar 31 '21 at 08:54
  • 1
    @GuiGS, it is not a real "leak". By default Rails will cache all query results, it can take too much RAM for this sutuation. This memory will be released later. `Foo.uncached` will disable caching. See https://api.rubyonrails.org/classes/ActiveRecord/QueryCache/ClassMethods.html#method-i-uncached – Dmitry Ukolov Apr 01 '21 at 09:10
  • 7
    I see. Looks like in_batches already does this https://github.com/rails/rails/blob/85c6823b77b60f2a3a6a25d7a1013032e8c580ef/activerecord/lib/active_record/relation/batches.rb#L226 (at least for Rails >= 5.2). Not sure if it would will still apply to `relation.pluck` though. – GuiGS Apr 13 '21 at 10:06
3

Here's a method to get the ids that were retrieved by the in_batches method itself, without need to run another query yourself.

in_batches already runs pluck(:id) under the hood (when load param is false which is the default) and yield the relation object. This relation object is created with where(id: ids_from_pluck).

Is it possible to get the list of ids directly from the relation object via where_values_hash method, without the need to run another query in DB. For example:

Foo.in_batches do |relation|
  ids = relation.where_values_hash['id']
end

This should work on both Rails 5.x and 6.x, but it relies on implementation detail of in_batches so it is not guaranteed to work in future.

GuiGS
  • 2,070
  • 1
  • 19
  • 18
0

Try this:

Foo.select(:id, :who).find_in_batches do |a_batch|
  ...
end
Larry Lv
  • 759
  • 4
  • 10
  • 5
    That will still create ActiveRecord objects, which I think the OP is trying to avoid for memory and speed reasons. – mpoisot Feb 21 '18 at 21:57
  • Not sure what above comment means.. For clarity, the `select` will alter the underlying SQL to only return the relevant values. `find_in_batches` already uses AR models, the question is how to limit the data being returned. This is what I required. – Tjad Clark Oct 07 '21 at 00:35