5

Rails Version: 4.2.7

Ruby Version: 2.3.3

I am running a rake task to migrate several million objects from CSV and JSON file format into my postgres database.

I was attempting to utilize activerecord-import to speed up writing the objects to the database.

Simplifying the code down as much as possible, the first half dealt with object type one (which came from one data type) and the second half dealt with object type two.

The first object type iterated like so (simplified for the question):

importing_object_one_array = []
my_external_data.each do |element|
  new_element = ObjectOne.new(
                              title: element[0],
                              body: element[1]
                             )
  importing_object_one_array << new_element
end
ObjectOne.import importing_object_one_array, validate: false

This ran on roughly 250,000 objects, and wrote without a any issues, I've inspected in the console and the objects are successfully written.

However, object type two has a fair few extra objects, each roughly the same size and design as object type one.

There are roughly 4,040,000 of these.

How long should I wait for ObjectTwo.import to run? We're into hours now.

Alternatively, from a debugging perspective (as I would really rather not re-run this rake task unless I absolutely have to), what scripts or tactics would be useful to see if ObjectTwo.import really is currently running (even if it's taking forever) or if the task is hanging?

I inspected the rails console and we appear to still be at the same number of ObjectTwo's in the database as before.


My only other thought is that as I didn't print out to the console before running #import (i.e. like puts "Now starting import!") I don't have 100% proof that the objects building in the array has finished.

Nick Schwaderer
  • 1,030
  • 1
  • 8
  • 21
  • 1
    that's (one reason) why you would usually use something like `in_groups_of` and import a chunk at a time. – phoet Feb 26 '17 at 18:59
  • 1
    you can use `each_with_index` and each iteration, print the index to show progress. To speed it up, see [here](http://stackoverflow.com/questions/15317837/bulk-insert-records-into-active-record-table) for ways to bulk insert (you can split up your records into groups for bulk insertion) – max pleaner Feb 26 '17 at 20:33
  • 2
    Use a background processing tool, like [sidekiq](https://github.com/mperham/sidekiq), so you can parallelize your import – MaicolBen Mar 04 '17 at 14:00
  • Split them in batches to be aware of progress – Nick Roz Apr 05 '18 at 15:09
  • 1
    You also can use the Parallel gem https://github.com/grosser/parallel. It's very helpful to bulk insert big datasets. – Alvaro Inckot May 17 '18 at 12:27

1 Answers1

0

Since it would be very hard to guess in advance how much time it needs to solve your problem (it depends on database load, indexes, and many other factors), I strongly suggest you work in batches.

Moreover, this will allow your database to face one very big request, which can exhaust its RAM, CPU, and network resource.

So, you could do something similar as:

If my_external_data is from an activerecord query

total_records_count = large_query.count 

large_query.find_in_batches(batch_size: 1000) do |batch|
   puts("Progress: #{100*processed_records/total_records_count}%") 
   ObjectOne.import importing_object_one_array, validate: false
end 

If my_external_data is a list of classical ruby objects

total_records_count = guess_number_of_records_to_process

large_array.in_groups_of(1000) do |batch|
   puts("Progress: #{100*processed_records/total_records_count}%") 
   ObjectOne.import importing_object_one_array, validate: false
end 

Use with preference solution 1 over solution 2, working directly with ActiveRecord and find_in_batches, which will iterate over your data with OFFSET and LIMIT SQL statements instead of putting all your records in RAM, then importing them.

If you could not re-design your query, expect the run time to be quite linear, excepted if your database resources are not well bigger than what is needed to process the batch; otherwise, run time will be quite unpredictable.

Alexis Clarembeau
  • 2,776
  • 14
  • 27