Ruby Parallel/Multithread Programming to read huge database

Question

I have a ruby script reading a huge table (~20m rows), doing some processing and feeding it over to Solr for indexing purposes. This has been a big bottleneck in our process. I am planning to speed things in here and I'd like to achieve some kind of parallelism. I am confused about Ruby's multithreading nature. Our servers have ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]. From this blog post and this question at StackOverflow it is visible that Ruby does not have a "real" multi threading approach. Our servers have multiple cores, so using parallel gem seems another approach to me.

What approach should I go with? Also, any inputs on parallel-database-read-feeding systems would be highly appreciated.

I use mysql gem to fetch N (~500) records at a time using MySQL's limit, offset parameters. Batch process them and batch feed them to Solr. Is more info needed? — pr4n, Sep 28 '11 at 09:09

score 4 · Answer 1 · answered Oct 06 '11 at 08:01

4

You can parallelize this at the OS level. Change the script so that it can take a range of lines from your input file

$ reader_script --lines=10000:20000 mytable.txt

Then execute multiple instances of the script.

$ reader_script --lines=0:10000 mytable.txt&
$ reader_script --lines=10000:20000 mytable.txt&
$ reader_script --lines=20000:30000 mytable.txt&

Unix will distribute them to different cores automatically.

answered Oct 06 '11 at 08:01

Matthias Berth

763
1
5
14

This seems like a reasonable approach. We have 8 cores, so I can run upto 8 instances very easily. – pr4n Oct 07 '11 at 07:35

Jonas Elfström · Answer 2 · 2011-09-28T08:32:15.983

1

Any chance of upgrading to Ruby 1.9? It's usually faster than 1.8.7.

It's true that Ruby suffers from having a GIL but if multithreading would solve your problem then you can take a look at JRuby since it supports true threading.

Also you better make sure it's the CPU that's the bottleneck because if it's I/O multithreading might not buy you much.

edited Sep 28 '11 at 08:32

answered Sep 28 '11 at 08:27

Jonas Elfström

30,834
6
70
106

Ruby Parallel/Multithread Programming to read huge database

2 Answers2