how to index tons of data at once with Rails, (re)tire, json without eating (all) memory?

Question

In a Rails 3.2.x app, using (Re)tire to access an ES cluster a rake task is going through approx 1M rows to create a new index. (Ruby 1.9.3).

The task is using .to_json with specific attributes and methods listed to limit the resulting hash for each element. Yet as the task run the memory is eaten away, ending with the process being killed usually by the system.

The task is already using find_by_batch. Smaller batches sizes (using find_each) don't help.

checking without index

Removing the index.import call does improve things (obviously). The task goes through the whole collection very fast without a problem. Pointing to either ES, tire or the JSON conversion (and the relations it might call upon).

reducing the scope of the task

Adding back index.import and passing a very limited hash (with string keys) for each item does make things slower but not too much and does not eat memory away. So json might no be the culprit here.

adding attributes and methods back

The culprit seems to be one of the method used to grab one of the additional attributes. It's based on a relation of the model and another ... Ending up with a lot of models being involved and sifted through.

As pointed out by Index the results of a method in ElasticSearch (Tire + ActiveRecord) adding includes does help a bit but the task does end up heavy too.

going around

I also tried to go around part of the problem and replace the calls to Tire with the use of ES bulk API. Generating json files and sending them with a Ruby http lib can work. Yet, the same problem arise : memory since the same requests to the DB are made.

What's left ?

What I don't get is why even with the find_by_batch Ruby keeps eating away memory. I would expect that after each batch of data, memory related that batch would be freed.

Next to try : GC.start calls, Active Record caching de activation around the tasks.

Yet, except if a solution limiting the memory use drastically (300 or 500Mo instead of 800+) the background issue is : indexing a lot of instances of a Model including data related to some other models.

am I missing something for the import and includes that would solve the issue ?
would splitting that task into smaller background jobs (resque, sidekiq) help ? I would suppose so as each batch would be isolated from the others and once treated, really free up the memory (?) (orchestrating those tasks would be another trouble)
is there good practices related to indexing big quantities of data into ES ?

jlecour · Answer 1 · 2014-05-05T11:58:20.447

I've been using Rails + Elasticsearch for a while and did this kind of dance a few times. A few things comes to mind, in no particular order.

Did you try to use the recent elasticsearch gem (instead of tire) ? I've updated my apps to use and like having more control on what is done.
I would also try to force a GC sweep after each ActiveRecord loop. You could also be extra careful with memory allocation by explicitly resetting all local variables each time.
You could use the fork & exec trick to fork a brand new process at each loop, it would be the most effective GC you can get. It's a little overhead when you write it the first time, but the pay-off is great. Take good care of limiting the amount of memory used in the outer part of the task. Using a process-based background task would partly achieve the same goal, but you might still get memory bloat.
Can you limit the use of ActiveRecord? If you need some basic associations you could use a lower-level/simpler tool like Sequel (or else) to use Ruby hashes/arrays instead of full fledged AR models.

1. migrating to the new gem is in the works, but as described the ES part does not appear to be relevant to the (memory) problem 2 & 3 : will try it out thanks ! — Thomas R., May 05 '14 at 11:55
You can find a lot of valuable information about fork & exec in Jesse Storimer's book, "Working With Unix Processes" : http://www.jstorimer.com/products/working-with-unix-processes — jlecour, May 05 '14 at 11:58