0

I've got a small little ruby script that pours over 80,000 or so records.
The processor and memory load involved for each record is smaller than a smurf balls, but it still takes about 8 minutes to walk all the records.

I'd though to use threading, but when I gave it a go, my db ran out of connections. Sure it was when I attempted to connect 200 times, and really I could limit it better than that.. But when I'm pushing this code up to Heroku (where I have 20 connections for all workers to share), I don't want to chance blocking other processes because this one ramped up.

I have thought of refactoring the code so that it conjoins the all the SQL, but that is going to feel really really messy.

So I'm wondering is there a trick to letting the threads share connections? Given I don't expect the connection variable to change during processing, I am actually sort of surprised that the thread fork needs to create a new DB connection.

Well any help would be super cool (just like me).. thanks


SUPER CONTRIVED EXAMPLE
Below is a 100% contrived example. It does display the issue.
I am using ActiveRecord inside a very simple thread. It seems each thread is creating it's own connection to the database. I base that assumption on the warning message that follows.
START_TIME = Time.now

require 'rubygems'
require 'erb'
require "active_record"

@environment = 'development'
@dbconfig = YAML.load(ERB.new(File.read('config/database.yml')).result)
ActiveRecord::Base.establish_connection @dbconfig[@environment]

class Product < ActiveRecord::Base; end

ids = Product.pluck(:id)
p "after pluck #{Time.now.to_f - START_TIME.to_f}"

threads = [];
ids.each do |id|
  threads << Thread.new {Product.where(:id => id).update_all(:product_status_id => 99); }
  if(threads.size > 4)
    threads.each(&:join)
    threads = [] 
    p "after thread join #{Time.now.to_f - START_TIME.to_f}"
  end
end

p "#{Time.now.to_f - START_TIME.to_f}"

OUTPUT

"after pluck 0.6663269996643066"
DEPRECATION WARNING: Database connections will not be closed automatically, please close your
database connection at the end of the thread by calling `close` on your
connection.  For example: ActiveRecord::Base.connection.close
. (called from mon_synchronize at /Users/davidrawk/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/monitor.rb:211)
.....
"after thread join 5.7263710498809814"   #THIS HAPPENS AFTER THE FIRST JOIN.
.....
"after thread join 10.743254899978638"   #THIS HAPPENS AFTER THE SECOND JOIN
baash05
  • 4,394
  • 11
  • 59
  • 97
  • `I have thought of refactoring the code` - I suspect you actually will have to - 80,000 DB accesses, as you imply, is going to be slow whatever you do. Can you show some code? – Ken Y-N Dec 03 '13 at 07:25
  • If you're using ActiveRecord then it's already shared. – pguardiario Dec 03 '13 at 07:38
  • Slow is all about perspective. :) 80,000 in 8 minutes would be fine, if I wasn't building the system to handle a couple million. Sadly I there's nothing special about the code. Just a small script to connect and update records. – baash05 Dec 03 '13 at 21:55

1 Answers1

2

See this gem https://github.com/mperham/connection_pool and answer, a connection pool might be what you need: Why not use shared ActiveRecord connections for Rspec + Selenium?

The other option would be to use https://github.com/eventmachine/eventmachine and run your tasks in EM.defer block in such a way that DB access happens in the callback block (within reactor) in a non-blocking way

Alternatively, and a more robust solution too, go for a light-weight background processing queue such as beanstalkd, see https://www.ruby-toolbox.com/categories/Background_Jobs for more options - this would be my primary recommendation

EDIT,

also, you probably don't have 200 cores, so creating 200+ parallel threads and db connections doesn't really speed up the process (slows it down actually), see if you can find a way to partition your problem into a number of sets equal to your number of cores + 1 and solve the problem this way,

this is probably the simplest solution to your problem

Community
  • 1
  • 1
bbozo
  • 7,075
  • 3
  • 30
  • 56
  • Thanks for the connection pool link, that might be the ticket. My 300 line ruby app is already running in a Delayed_Job which is sort of why I was looking for something. It sucks to have 250megs of ram and use 3megs, but at the same time have 10 minutes to get a job done, and take 8 (or more). I want to trade speed for ram, and "core +1" threads seems to be one way to accomplish this. I couldn't even have two threads before your answer, because connections was my limit. Again thanks! – baash05 Dec 03 '13 at 21:57
  • I gave the second link a go, and my simple little app crashed each time it attempted to connect in the thread. – baash05 Dec 04 '13 at 00:28
  • event machine fires up distinct connections to the DB. :( – baash05 Dec 04 '13 at 00:31
  • EM.defer uses a thread pool to execute its code (default is 10 threads iirc) which means that no more then number-of-threads connections will be up and your CPUs should still be pegged at 100% given an appropriate thread count for your system. – bbozo Dec 04 '13 at 10:56
  • The other thing EM is good at is using the maximum out of your cpu, which means that if you make sure all your IO calls are non-blocking your app will not wait for db responses and other socket operations which would enable your processors to be used to the max. This means if you have x cores, you can set up x EM reactors and assuming your reactor is well written you should be able to squeeze the last bit of processing power from your system. – bbozo Dec 04 '13 at 11:02
  • @baash05 Also, the trade-off between computing power and memory isn't as obvious as you describe it. You will get the best performance when you limit the number of worker threads so your CPUs are pegged at 100%. Once you start forking more then this you will actually *degrade* performance due to context switching overhead between threads. However, this entire discussion applies to jruby and rbx, and **not to MRI ruby**, if you're using MRI then you need to fork processes (not threads) or you need to use EM because the GIL will put a hard limit on your processing that thread spam won't fix – bbozo Dec 04 '13 at 11:08
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/42494/discussion-between-dosadnizub-and-baash05) – bbozo Dec 04 '13 at 12:24