76

I'm trying to run a query of about 50,000 records using ActiveRecord's find_each method, but it seems to be ignoring my other parameters like so:

Thing.active.order("created_at DESC").limit(50000).find_each {|t| puts t.id }

Instead of stopping at 50,000 I'd like and sorting by created_at, here's the resulting query that gets executed over the entire dataset:

Thing Load (198.8ms)  SELECT "things".* FROM "things" WHERE "things"."active" = 't' AND ("things"."id" > 373343) ORDER BY "things"."id" ASC LIMIT 1000

Is there a way to get similar behavior to find_each but with a total max limit and respecting my sort criteria?

Avishai
  • 4,512
  • 4
  • 41
  • 67

14 Answers14

73

The documentation says that find_each and find_in_batches don't retain sort order and limit because:

  • Sorting ASC on the PK is used to make the batch ordering work.
  • Limit is used to control the batch sizes.

You could write your own version of this function like @rorra did. But you can get into trouble when mutating the objects. If for example you sort by created_at and save the object it might come up again in one of the next batches. Similarly you might skip objects because the order of results has changed when executing the query to get the next batch. Only use that solution with read only objects.

Now my primary concern was that I didn't want to load 30000+ objects into memory at once. My concern was not the execution time of the query itself. Therefore I used a solution that executes the original query but only caches the ID's. It then divides the array of ID's into chunks and queries/creates the objects per chunk. This way you can safely mutate the objects because the sort order is kept in memory.

Here is a minimal example similar to what I did:

batch_size = 512
ids = Thing.order('created_at DESC').pluck(:id) # Replace .order(:created_at) with your own scope
ids.each_slice(batch_size) do |chunk|
    Thing.find(chunk, :order => "field(id, #{chunk.join(',')})").each do |thing|
      # Do things with thing
    end
end

The trade-offs to this solution are:

  • The complete query is executed to get the ID's
  • An array of all the ID's is kept in memory
  • Uses the MySQL specific FIELD() function

Hope this helps!

Dirk Geurs
  • 2,392
  • 19
  • 24
  • 5
    The 4.x documentation says limit is not supported. But the latest 5.x documentation does seem to honor limit. – Kirk May 11 '18 at 19:09
  • 1
    Your answer is outdated now. Both limit and order are supported since 5.x and 6.1 respectively. Please update your post. Thanks! – slhck Nov 22 '21 at 12:47
27

find_each uses find_in_batches under the hood.

Its not possible to select the order of the records, as described in find_in_batches, is automatically set to ascending on the primary key (“id ASC”) to make the batch ordering work.

However, the criteria is applied, what you can do is:

Thing.active.find_each(batch_size: 50000) { |t| puts t.id }

Regarding the limit, it wasn't implemented yet: https://github.com/rails/rails/pull/5696


Answering to your second question, you can create the logic yourself:

total_records = 50000
batch = 1000
(0..(total_records - batch)).step(batch) do |i|
  puts Thing.active.order("created_at DESC").offset(i).limit(batch).to_sql
end
rorra
  • 9,593
  • 3
  • 39
  • 61
  • Is there a different way to achieve this? – Avishai Mar 03 '13 at 23:47
  • @jan-hettich, I wrote that *find_in_batches* doesn't support the *limit* option in my original answer, I also pointed to the pull request that implemented the option yet it was never accepted/merged. – rorra Jun 27 '13 at 04:31
  • 1
    This solution will get you into trouble if you are mutating the objects when processing the batches. You might either skip some or have doubles if the mutation has an effect on the sort order in the database. – Dirk Geurs Nov 06 '13 at 17:31
  • @dirk-geurs Feel free to write a better solution to the question made. – rorra Nov 06 '13 at 22:39
  • This wont work because most of time you want to sort on something that will change during the batch size. – Ivailo Bardarov Apr 13 '15 at 15:45
  • 1
    `total_records - batch` could be less than the `batch` size, which would be a negative range. I would call `abs` on the calculation to ensure at least one iteration of the results: e.g. `(0..(total_records - batch).abs)` – Ben Simpson May 11 '15 at 14:41
  • 1
    In order to not miss the last batch when `total_records` is not a multiple of `batch` (and even when it is), your range should be `(0..(total_records - 1))`. – Dennis Williamson Sep 14 '16 at 22:05
  • @BenSimpson is right, I used `(0..[(total_records - batch), total_records].max)`. But @Paused ... solves it better. Eg. `batch = 10` , `total_records=35` => `steps = [0,10,20]` and 31..35 is not covered. – Foton Nov 20 '19 at 15:56
  • I guess if we decrease then we might skip extra records that are out of the max bound. Shouldn't it be (0..(records_size + batch_size)).step(batch_size) ? – ilgam Mar 27 '23 at 15:08
23

Retrieving the ids first and processing the in_groups_of

ordered_photo_ids = Photo.order(likes_count: :desc).pluck(:id)

ordered_photo_ids.in_groups_of(1000, false).each do |photo_ids|
  photos = Photo.order(likes_count: :desc).where(id: photo_ids)

  # ...
end

It's important to also add the ORDER BY query to the inner call.

Thomas Klemm
  • 10,678
  • 1
  • 51
  • 54
  • 3
    Unlike the accepted answer, this works in PostgreSQL. Also, nice job keeping the answer succinct. – kdt Jan 06 '16 at 03:06
  • 4
    This is going to require plucking all the IDs for the table in one query, and I don't know if that's advisable for larger tables (which is where you'd be using find_in_batches anyway). – Ibrahim Dec 13 '17 at 22:23
  • Although I guess for something like this you might have to resort to getting all the IDs if you need to sort by an arbitrary column. – Ibrahim Dec 13 '17 at 22:29
  • 1
    @Darme - it really shouldn't be, since it pulls all IDs from the table into RAM. This is what the original question wants to avoid. There's a gigantic (unbounded) database query and unbounded RAM requirements for the ID array in Ruby, then it all has to be marshalled into the mother of all SQL queries when that ID list is quoted back to the database via `WHERE...IN`. It looks elegant but that's the risk sometimes with Ruby - you don't realise the impact sometimes. It's not as bad as full records, but even IDs will be heavy going with large tables. – Andrew Hodgkinson Sep 16 '19 at 22:42
  • @AndrewHodgkinson you're right: it seemed to work right at first but after a while I ran into this specific issue. – Darme Sep 24 '19 at 07:21
  • If you're using MySQL, the IDs are already in memory.. how can you fetch the IDs in batches from SQL, without having to store all IDs in memory at once? – Tilo May 04 '20 at 22:29
6

Rails 6.1 adds support for descending order in find_each, find_in_batches and in_batches.

Bugs
  • 4,491
  • 9
  • 32
  • 41
rassom
  • 2,896
  • 5
  • 34
  • 45
4

One option is to put an implementation tailored for your particular model into the model itself (speaking of which, id is usually a better choice for ordering records, created_at may have duplicates):

class Thing < ActiveRecord::Base
  def self.find_each_desc limit
    batch_size = 1000
    i = 1
    records = self.order(created_at: :desc).limit(batch_size)
    while records.any?
      records.each do |task|
        yield task, i
        i += 1
        return if i > limit
      end
      records = self.order(created_at: :desc).where('id < ?', records.last.id).limit(batch_size)
    end
  end
end

Or else you can generalize things a bit, and make it work for all the models:

lib/active_record_extensions.rb:

ActiveRecord::Batches.module_eval do
  def find_each_desc limit
    batch_size = 1000
    i = 1
    records = self.order(id: :desc).limit(batch_size)
    while records.any?
      records.each do |task|
        yield task, i
        i += 1
        return if i > limit
      end
      records = self.order(id: :desc).where('id < ?', records.last.id).limit(batch_size)
    end
  end
end

ActiveRecord::Querying.module_eval do
  delegate :find_each_desc, :to => :all
end

config/initializers/extensions.rb:

require "active_record_extensions"

P.S. I'm putting the code in files according to this answer.

Community
  • 1
  • 1
x-yuri
  • 16,722
  • 15
  • 114
  • 161
3

You can iterate backwards by standard ruby iterators:

Thing.last.id.step(0,-1000) do |i|
  Thing.where(id: (i-1000+1)..i).order('id DESC').each do |thing|
    #...
  end
end

Note: +1 is because BETWEEN which will be in query includes both bounds but we need include only one.

Sure, with this approach there could be fetched less than 1000 records in batch because some of them are deleted already but this is ok in my case.

Lev Lukomsky
  • 6,346
  • 4
  • 34
  • 24
3

As remarked by @Kirk in one of the comments, find_each supports limit as of version 5.1.0.

Example from the changelog:

Post.limit(10_000).find_each do |post|
  # ...
end

The documentation says:

Limits are honored, and if present there is no requirement for the batch size: it can be less than, equal to, or greater than the limit.

(setting a custom order is still not supported though)

tsauerwein
  • 5,841
  • 3
  • 36
  • 49
2

I was looking for the same behaviour and thought up of this solution. This DOES NOT order by created_at but I thought I would post anyways.

max_records_to_retrieve = 50000
last_index = Thing.count
start_index = [(last_index - max_records_to_retrieve), 0].max
Thing.active.find_each(:start => start_index) do |u|
    # do stuff
end

Drawbacks of this approach: - You need 2 queries (first one should be fast) - This guarantees a max of 50K records but if ids are skipped you will get less.

Moemars
  • 4,692
  • 3
  • 27
  • 30
  • Since I found this while looking for skip + find_each, it's worth mentioning here: the :start option can be used as an equivalent to skip() that you might use otherwise. – Yourpalal Mar 09 '16 at 16:36
2

You can try ar-as-batches Gem.

From their documentation you can do something like this

Users.where(country_id: 44).order(:joined_at).offset(200).as_batches do |user|
  user.party_all_night!
end
Martin
  • 11,216
  • 23
  • 83
  • 140
0

Using Kaminari or something other it will be easy.

Create batch loader class.

module BatchLoader
  extend ActiveSupport::Concern

  def batch_by_page(options = {})
    options = init_batch_options!(options)

    next_page = 1

    loop do
      next_page = yield(next_page, options[:batch_size])

      break next_page if next_page.nil?
    end
  end

  private

  def default_batch_options
    {
      batch_size: 50
    }
  end

  def init_batch_options!(options)
    options ||= {}
    default_batch_options.merge!(options)
  end
end

Create Repository

class ThingRepository
  include BatchLoader

  # @param [Integer] per_page
  # @param [Proc] block
  def batch_changes(per_page=100, &block)
    relation = Thing.active.order("created_at DESC")

    batch_by_page do |next_page|
      query = relation.page(next_page).per(per_page)
      yield query if block_given?
      query.next_page
    end
  end
end

Use the repository

repo = ThingRepository.new
repo.batch_changes(5000).each do |g|
  g.each do |t|
    #...
  end
end
merqlove
  • 3,674
  • 1
  • 23
  • 22
0

Adding find_in_batches_with_order did solve my usecase, where I was having ids already but need batching and ordering. It was inspired by @dirk-geurs solution

# Create file config/initializers/find_in_batches_with_order.rb with follwing code.
ActiveRecord::Batches.class_eval do
  ## Only flat order structure is supported now
  ## example: [:forename, :surname] is supported but [:forename, {surname: :asc}] is not supported
  def find_in_batches_with_order(ids: nil, order: [], batch_size: 1000)
    relation = self
    arrangement = order.dup
    index = order.find_index(:id)

    unless index
      arrangement.push(:id)
      index = arrangement.length - 1
    end

    ids ||= relation.order(*arrangement).pluck(*arrangement).map{ |tupple| tupple[index] }
    ids.each_slice(batch_size) do |chunk_ids|
      chunk_relation = relation.where(id: chunk_ids).order(*order)
      yield(chunk_relation)
    end
  end
end

Leaving Gist here https://gist.github.com/the-spectator/28b1176f98cc2f66e870755bb2334545

the_spectator
  • 1,345
  • 11
  • 26
0

I had the same problem with a query with DISTINCT ON where you need an ORDER BY with that field, so this is my approach with Postgres:

def filtered_model_ids
  Model.joins(:father_model)
       .select('DISTINCT ON (model.field) model.id')
       .order(:field)
       .map(&:id)
end

def processor
  filtered_model_ids.each_slice(BATCH_SIZE).lazy.each do |batch|
    Model.find(batch).each do |record|
      # Code
    end
  end
end
0

My code

batch_size = 100
total_count = klass.count
offset = 0
processed_count = 0
while processed_count < total_count
  relation = klass.order({ active_at: :asc, created_at: :desc }).offset(offset).limit(batch_size)
  relation.each do |record|
    record.process
  end
  processed_count += batch_size
end
artamonovdev
  • 2,260
  • 1
  • 29
  • 33
-1

Do it in one query and avoid iterating:

User.offset(2).order('name DESC').last(3)

will product a query like this

SELECT "users".* FROM "users" ORDER BY name ASC LIMIT $1 OFFSET $2 [["LIMIT", 3], ["OFFSET", 2]

Max
  • 281
  • 3
  • 11