0

I have a problem when starting to run my scrapers into threads. So, I have 3 services which scraps data from web-pages and I want to place them into 3 threads, and look how they are working together. Also, in the future I want to create more scrapers.

parser_controller.rb

def call_all_parsers
    file = File.read('app/controllers/matches.json')
    data = JSON.parse(file)

    threads = []
    data.each_key do |office|
        data[office].each_key do |link|                
            if office == 'first_office'
               p threads << Thread.new { Services::Scrapers::FirstScraperService.new.parse(link, data[office][link]) }
            elsif office == 'second_office'
               p threads << Thread.new { Services::Scrapers::SecondScraperService.new.parse(link, data[office][link]) }
            elsif office == 'third_office'
               p threads << Thread.new { Services::Scrapers::ThirdScraperService.new.parse(link, data[office][link]) } 
            end
        end
    end
    p threads.map(&:join)
    render 'calculate_arbitration/index'
end

When I started call_all_parsers method, it hung. How should I do this operation or you could give an advice to use something else instead of threads.

Update

My scrapers do some operations with database(read/write/delete operations). When I said it hungs, I meant that threads started to run but no results in database and I don't know how long should I wait for the result. Let me show the example of status of 3 threads:

Started GET "/parser" for 127.0.0.1 at 2019-12-18 13:08:35 +0300

Processing by ParserController#call_all_parsers as HTML

[Thread:0x000000031a2158@/home/test/web-programming/parser/backend/app/controllers/parser_controller.rb:19 run, Thread:0x0000000315ec50@/home/test/web-programming/parser/backend/app/controllers/parser_controller.rb:21 run, Thread:0x0000000315c450@/home/test/web-programming/parser/backend/app/controllers/parser_controller.rb:23 run]

Community
  • 1
  • 1
Artsom
  • 161
  • 3
  • 11

2 Answers2

0

You are limited by the number of connections. Use a different connection for every thread you open:

Thread.new do 
  ActiveRecord::Base.connection_pool.with_connection do
    Services::Scrapers::FirstScraperService.new.parse(link, data[office][link])
  end
end

Documentation: https://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/ConnectionPool.html

Default limit for Rails is 5 connections. Plus the main Thread will consume a connection. So you can open only 4 new threads. If you need more, increase the number of connection pool in your database.yml

Also if you are interested by some sugar syntax, I would advise you to checkout the Parallel gem: https://github.com/grosser/parallel

andoke
  • 184
  • 13
  • Unfortunately, this solution doesn't work, it redirect me to method with name `with_connection` and inside this method the program hungs and I have no result. – Artsom Dec 26 '19 at 15:32
  • So you need to check if everything is threadsafe in your scraper services. – andoke Dec 27 '19 at 23:22
  • So, how can I understand that? Could you please give me advice? – Artsom Dec 30 '19 at 07:43
  • For thread-safety: look this other answer: https://stackoverflow.com/a/261690/787436 – andoke Dec 31 '19 at 16:54
  • thank you, but I run just one thread(which has access to the shared data) and it doesn't work correctly, I didn't try to run 2 or more threads. – Artsom Jan 03 '20 at 12:48
0

I think you need to create a thread pool of a certain size otherwize your number of threads does not have an upper limit. It will not work in reality.

There can be multiple reasons why your threads hang up. After you started getting db connections from the pool now your threads will wait until there's an available connection. There's a timeout for waiting and you should eventually see exceptions related to the timeout.

In your case, I think, the issue is in using the same connection for all the threads. Moreover, with_connection will not help you as suggested by one of the answers. From the documentation:

ConnectionPool is completely thread-safe, and will ensure that a connection cannot be used by two threads at the same time, as long as ConnectionPool's contract is correctly followed.

So you need to obtain connections in the spawned threads. Otherwise all the calls to with_connection will return the same db connection.