5

I have Ruby code that more or less look like this

offset = 0
index = 1

User.establish_connection(..) # db1
class Member < ActiveRecord::Base
  self.table_name = 'users'
end 

Member.establish_connection(..) #db2

while true
  users = User.limit(10000).offset(offset).as_json ## for a Database 1
  offset = limit * index
  index += 1
  users.each do |u|
    member =  Member.find_by(name: u[:name])
    if member.nil?
      Member.create(u)
    elsif member.updated_at < u[:updated_at]   
      member.update_attributes(u)   
    end
  end 
  break if break_condition
end

What I'm seeing is that the RSS memory(htop) keep growing and at one point it reaches 10GB. I'm not sure why is this happening but memory is never seem to be released by Ruby back to the OS.

I'm aware there is a long list of questions that are inline with this. I have even tried changing by code to look like this (last 3 line specifically).i.e Running GC.start manually result still the same.

while true

....
...
...
users = nil
GC.start
break if break_condition
end

Tested this on Ruby version 2.2.2 and 2.3.0

EDIT: Other detail

1) OS.

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=15.04
DISTRIB_CODENAME=vivid
DISTRIB_DESCRIPTION="Ubuntu 15.04"

2) ruby installed and complied via rvm.

3) ActiveRecord version 4.2.6

Viren
  • 5,812
  • 6
  • 45
  • 98
  • 1
    `when`? Do you mean `while`? – matt Apr 21 '16 at 13:28
  • 1
    `more or less look like this` maybe it is better to show exact code? – fl00r Apr 21 '16 at 13:32
  • @fl00r It is exact code expect the class or model name is changed – Viren Apr 21 '16 at 13:57
  • @matt yup that a typo – Viren Apr 21 '16 at 13:57
  • @floor also I have removed some `puts` call for nicety – Viren Apr 21 '16 at 13:58
  • you can investigate using `GC.stat` or some profiler – niceman Apr 21 '16 at 14:11
  • try `member=nil` at the end of do {..} clause inside `users.each` – niceman Apr 21 '16 at 14:18
  • Could it be possible that there is a lot of members with the same name? – fl00r Apr 21 '16 at 14:25
  • @floor Nope there is a unique constraint on name. What I can say the record are huge but I expect it not eat up 10GB at first place on the first run(of while loop) the memory stays at 3234MB (according to htop) on next run the it spike again and it keep on going like this until no more memory left. – Viren Apr 21 '16 at 14:33
  • 1
    @niceman Not much time to test profiling stuff, running in production mode. If I do get time I will share the result. – Viren Apr 21 '16 at 14:39
  • @floor Also I think `Member.find_by()` return a single `entity` i.e `limit 1` I believe – Viren Apr 21 '16 at 14:41
  • I had this exact issue in the past using MongoDB. Spent a lot of time trying to get the memory to cleanup but Ruby just hangs on to until the machine is dying. – Mike S Apr 21 '16 at 15:12
  • What db engine, how do you run the code ? Besides different data, unclear value of `limit` I did tested equivalent code on one of larger rails 4.2 app project & postgresql db and it does *not* leak. Ruby 2.2 & 2.3.x – joanbm Apr 21 '16 at 16:41
  • postgresql 9.4 and yes activerecord-4.2. the above `user` record has more info in it. – Viren Apr 21 '16 at 16:50
  • @joanbm What I can say the record are huge but I expect it to not eat up 10GB at first place on the first run(of while loop) the memory stays around 3234MB (according to htop) on next run the it spikes and it keeps on going like this until no more memory left – Viren Apr 21 '16 at 16:54
  • @Viren It's neigh impossible decide the cause from such fragment. It may be related to code run indirectly like AR callbacks bound to `Member` model. I've run similar code in my app and after a short warmup RSS topped at ~320 MiB, not exeeded after thousands of iterations. Table contains tens of thousands records, 14 columns. – joanbm Apr 21 '16 at 17:33

1 Answers1

2

I can't tell you the source of the memory leak, but I do spy some low-hanging fruit.

But first, two things:

  1. Are you sure that ActiveRecord is the right way to copy data from one database to another? I'm very confident that it's not. Every major database product has robust export and import capabilities, and the performance you'll see there will be many, many times better than doing it in Ruby, and you can always invoke those tools from within your app. Think hard about that before you continue down this path.

  2. Where does the number 10,000 come from? Your code suggests that you know it's not a good idea to fetch all of the records at once, but 10,000 is still a lot of records. You may see some gains by simply trying different numbers: 100 or 1,000, say.

That said, let's dig into what this line is doing:

users = User.limit(10000).offset(offset).as_json

The first part, User.limit(10000).offset(offset) creates an ActiveRecord::Relation object representing your query. When you call as_json on it, the query is executed, which instantiates 10,000 User model objects and puts them in an array, and then a Hash is constructed from each of those User objects' attributes. (Take a look at the source for ActiveRecord::Relation#as_json here.)

In other words, you're instantiating 10,000 User objects only to throw them away after you've got their attributes.

So, a quick win is to skip that part entirely. Just select the raw data:

user_keys = User.attribute_names

until break_condition
  # ...
  users_values = User.limit(10000).offset(offset).pluck(user_keys)

  users_values.each do |vals|
    user_attrs = user_keys.zip(vals).to_h
    member = Member.find_by(name: user_attrs["name"])
    member.update_attributes(user_attrs)  
  end
end

ActiveRecord::Calculations#pluck returns an array of arrays with the values from each record. Inside the user_values.each loop we turn that values array into a Hash. No need to instantiate any User objects.

Now let's take a look at this:

member = Member.find_by(name: user_attrs["name"])
member.update_attributes(user_attrs)

This selects a record from the database, instantiates a Member object, and then updates the record in the database—10,000 times in every iteration of the while loop. This is the correct approach if you need validations to run when that record is updated. If you don't need validations to run, though, you can save time and memory by, again, not instantiating any objects:

Member.where(name: user_attrs["name"]).update_all(user_attrs)

The difference is that ActiveRecord::Relation#update_all doesn't select the record from the database or instantiate a Member object, it just updates it. You said in your comment above that you have a unique constraint on the name column, so we know that this will update only a single record.

Having made those changes, you must still contend with the fact that you have to do 10,000 UPDATE queries in each iteration of the while loop. Again, consider using your databases' built-in export and import functionality instead of trying to make Rails do this.

Jordan Running
  • 102,619
  • 17
  • 182
  • 182
  • Thanks for answer. Apologize the copy to different database is not that straight forward hence can't use pg_import and pg_dump. – Viren Apr 21 '16 at 17:03
  • I have update the code to show how the copy over work. – Viren Apr 21 '16 at 17:03
  • Still, there are better ways to do this. You're basically doing an [upsert](http://stackoverflow.com/questions/17267417/how-to-upsert-merge-insert-on-duplicate-update-in-postgresql) with a simple condition on `updated_at`. If the data were in two separate tables in the same database, you could do a JOIN with the same condition to get the rows to be upserted. Since they're not in the same database you could either export-and-import into a table with a different name or use [postgres_fdw](http://www.postgresql.org/docs/9.3/static/postgres-fdw.html) to connect directly to the other database. – Jordan Running Apr 21 '16 at 17:18