How to run some jobs in parallel

Question

After reading all posts about parallelism in Ruby, I got only confused, so I will describe what I want to do.

I have names that contains around 1000 names.

names
=> [{"name"=>"tickets"}, {"name"=>"events"}, {"name"=>"channel"}, {"name"=>"primes"}]

For each name I want to drop a table if it exists using pg.

drop_str = "DROP TABLE IF EXISTS %s ;"
create_str = "CREATE TABLE %s (id SERIAL PRIMARY KEY,bkk varchar(255))"

names.each do |name|
    conn.exec((drop_str % name["name"]) + (create_str % name["name"]))
end

But, I do not want to drop tables one after another one. I want to do it in parallel.

My idea is to use following:

threads = []
drop_str = "DROP TABLE IF EXISTS %s ;"
create_str = "CREATE TABLE %s (id SERIAL PRIMARY KEY,bkk varchar(255))"

names.each do |name|
    threads.push(Thread.new{conn.exec((drop_str % name["name"]) + (create_str % name["name"]))})
end

and then to join the threads.

In reality will the tables be dropped in parallel or still one after another one?

Why would you wish to drop them in parallel=? If you could explain the reason behind that would be helpful? — Abdullah, Jan 05 '17 at 15:40
@MuhammadAbdullah to save time. if I do it in parallel it will be faster than if I do it one after another one. (In fact, I have 1K tables, I need to preprocess each table, add records independently for each table. If I do it in a row, it will b slow. If I do it in parallel for each table, it will be faster) — Alina, Jan 05 '17 at 15:44
Usually, Sidekiq or jRuby are used for parallel tasks. Read more about Ruby's GIL (especially MRI GIL) — Ilya, Jan 05 '17 at 15:54
At the first glance, it seems like it could be a rescue too. But, actually, I didn't try it. — Ilya, Jan 05 '17 at 16:09
@Ilya the GIL does not restrict IO-bound operations in any way. For most things web apps do (including this problem at hand here), MRI works just fine since most operations are not CPU-bound but IO-bound. — Holger Just, Jan 05 '17 at 16:15
@HolgerJust, threads belonging the same Ruby process can never be executed in parallel in MRI because of GIL, isn't it? — Ilya, Jan 05 '17 at 16:26
@Ilya I am a bit confused because [here](http://stackoverflow.com/questions/7561629/how-to-process-items-in-an-array-in-parallel-using-ruby-and-open-uri) derp gives an example corresponding to what I have — Alina, Jan 05 '17 at 16:30
@Tonja, maybe http://www.csinaction.com/2014/10/10/multithreading-in-the-mri-ruby-interpreter/ will be useful for you. — Ilya, Jan 05 '17 at 16:37
you can easily benchmark the result. But I'd choose a smaller number of threads to start out with. — max pleaner, Jan 05 '17 at 16:38
@Ilya The problem is that when I am running my code, it always gives me different errors and I think it is due to the fact that my database (redshift) always wants to send me information if the table was dropped etc — Alina, Jan 05 '17 at 17:00
@Ilya Without distracting from the original question too much, if threads are blocked waiting for some IO request to return (e.g. during filesystem access, network io, ...) Ruby schedules other threads. This is similar to when you have only a single CPU core but can still execute multiple concurrent programs. — Holger Just, Jan 05 '17 at 17:33
Why drop the table then immediately recreate it? Just truncate it. Also, instead of using the pg gem, I'd strongly recommend using an ORM like Sequel or ActiveRecord. They're plenty fast and abstract the code from the DBM allowing easy porting. Also, if you need to process in parallel, it seems like you could wrap everything in a transaction, execute your commands, then commit the transaction which'd force the DBM to do everything in parallel, or at least to the best of its ability. — the Tin Man, Jan 05 '17 at 18:30

score 0 · Accepted Answer · answered Jan 05 '17 at 17:29

In principle, you can run multiple SQL statements in parallel. Most database engines are multi-threaded and can execute multiple statements in parallel Sometimes it doesn't make much sense though, as when using SQLite.

There are several caveats though which will probably break your current code.

Most importantly, a single connection to your database always has some state attached to it. Often, it will hold transactions and internal state of the database adapter. As such, a single database connection is generally only usable in a single thread at a time. If you attempt to send multiple parallel statements over a single connection, things will probably break pretty in-deterministically.

Thus when trying to run multiple statements in parallel using threads, each threads needs its own database connection. Here, it often makes sense to use thread-pools which create a fixed upper number of connections and schedule work from a queue to run on these.

You could use e.g. Rails' ConnectionPool to handle the database connections and schedule your statements using one of the ThreadPool implementations from the excellent concurrent-ruby gem.

ConnectionPool is part of ActiveRecord, which doesn't necessary mean using Rails. There's no reason to pull in the Rails stack just for that. An excellent alternative to ActiveRecord is Sequel. — the Tin Man, Jan 05 '17 at 18:25

How to run some jobs in parallel

1 Answers1