0

I have a Ruby on Rails project in which there are millions of products with different urls. I have a function "test_response" that checks the url and returns either a true or false for the Product attribute marked_as_broken, either way the Product is saved and has its "updated_at"-attribute updated to the current Timestamp.

Since this is a very tedious process I have created a task which in turn starts off 15 tasks, each with a N/15 number of products to check. The first one should check from, for example, the first to the 10.000th, the second one from the 10.000nd to the 20.000nd and so on, using limit and offset.

This script works fine, it starts off 15 process but rather quickly completes one script after another far too early. It does not terminate, it finishes with a "Process exited with status 0".

My guess here is that using find_each together with a search for updated_at as well as in fact updating the "updated_at" while running the script changes everything and does not make the script go through the 10.000 items as supposed but I can't verify this.

Is there something inherently wrong by doing what I do here. For example, does "find_each" run a new sql query once in a while providing completely different results each time, than anticipated? I do expect it to provide the same 10.000 -> 20.000 but just split it up in pieces.

task :big_response_launcher => :environment do
  nbr_of_fps = Product.where(:marked_as_broken => false).where("updated_at < '" + 1.year.ago.to_date.to_s + "'").size.to_i
  nbr_of_processes = 15
  batch_size = ((nbr_of_fps / nbr_of_processes))-2
  heroku = PlatformAPI.connect_oauth(auth_code_provided_elsewhere)  
  (0..nbr_of_processes-1).each do |i|
    puts "Launching #{i.to_s}"
    current_offset = batch_size * i
    puts "rake big_response_tester[#{current_offset},#{batch_size}]"
    heroku.dyno.create('kopa', {
      :command => "rake big_response_tester[#{current_offset},#{batch_size}]",
      :attach => false
    }) 
  end

end

task :big_response_tester, [:current_offset, :batch_size] => :environment do |task,args|
  current_limit = args[:batch_size].to_i
  current_offset = args[:current_offset].to_i  
  puts "Launching with offset #{current_offset.to_s} and limit #{current_limit.to_s}"
  Product.where(:marked_as_broken => false).where("updated_at < '" + 1.year.ago.to_date.to_s + "'").limit(current_limit).offset(current_offset).find_each do |fp|
    fp.test_response
  end  
end
Christoffer
  • 2,271
  • 3
  • 26
  • 57
  • Consider using keyset pagination for this task. [This post](https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/) provides a great overview. – moveson Aug 26 '19 at 13:18
  • find_each does run a new query using the last record id and limit after every batch https://apidock.com/rails/v2.3.8/ActiveRecord/Batches/ClassMethods/find_in_batches. If you modify records on another task, the next iteration of the find_each loop could read different records since the query is executed again. – arieljuod Aug 26 '19 at 17:52
  • Thanks, moveson and arieljuod! I will check out keyset pagination and investigate find_each further. – Christoffer Aug 27 '19 at 04:17
  • `find_each` ignores limit as it runs `find_in_batches` under the hood with a default limit of 1000 – Int'l Man Of Coding Mystery Aug 27 '19 at 07:00
  • Thanks again. Mike Heft: Does that mean that both find_each and find_in_batches will in a way ignore a Product.limit(1000).offset(500)? If I run a script that will start off sections/offsets at 1000, 2000, 3000, 4000 with a limit of 1000 each, will that cause the troubles I have, you think? – Christoffer Aug 27 '19 at 11:45
  • 1
    @Christoffer ya, it will run with limit of 1000 and offsets every 1000, so 1k, 2k,3k.... That's why they suggest to use `#each` if your dataset is small. Sorry I may have misunderstood. find_each will ignore your limits and offsets, and use the ones it has defined internally, AKAIK. If you need your own limit/offest then each might be better – Int'l Man Of Coding Mystery Aug 27 '19 at 11:47
  • So, maybe I should have asked the question on SO instead like: "How do I run 15 PARALLELL one-off-processes, each going through a 15th of the table/data performing a url-tester ".test_response" on each. – Christoffer Aug 27 '19 at 14:09

1 Answers1

0

As many have noted in the comments, it seems like using find_each will ignore the order and limit. I found this answer (ActiveRecord find_each combined with limit and order) that seems to be working for me. It's not working 100% but it is a definite improvement. The rest seems to be a memory issue, i.e. I cannot have too many processes running at the same time on Heroku.

Christoffer
  • 2,271
  • 3
  • 26
  • 57