11

So I have been using selenium to make my scraping. BUT I want to change all the code to Scrapy. The only thing I'm no sure about is that I'm using multiprocessing (python library) to speed up my process. I have researched a lot but I quite don't get it. I have found: Multiprocessing of Scrapy Spiders in Parallel Processes but it doesn't help me because it says that it can be done with Twisted but I haven't found an example yet.

In other forums it says that Scrapy can work with multiprocessing.

Last thing, in scrapy the option CONCURRENT_REQUESTS (settings) has some connection with multiprocessing?

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
AngelLB
  • 153
  • 2
  • 9
  • if you need more help, you can just comment here, I'll try to help as much as possible – eLRuLL Dec 12 '18 at 00:50
  • I've been working all of my spider in one script, that's it, actually i don't make a lot of processing with the data, i just get the data and append it to a file using pandas (obviously there is a little bit of processing like captchas to get the data). So, when you say "Separate the processes that get the information from the ones that consume that information" what do you mean? .... And another thing, what can we do with twisted? there is a way to speed up more the process? – AngelLB Dec 12 '18 at 15:01

3 Answers3

7

The recommended way for working with scrapy is to NOT use multiprocessing inside the running spiders.

The better alternative would be to invoke several scrapy jobs with the respective separated inputs.

Scrapy jobs themselves are very fast IMO, of course, you can always go faster, special settings as you mentioned CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY, etc. But this is basically because scrapy is asynchronous, meaning it won't wait for the requests to be completed to schedule and continue working on the remaining tasks (scheduling more requests, parsing responses, etc.)

The CONCURRENT_REQUESTS doesn't have a connection with multiprocessing. It is mostly a way to "limit" the speed of how many requests could be scheduled, because of being asynchronous.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • 1
    calling scrapy within a script is not bad, but is also not the recommended way. You need to remember that Scrapy is a "Web Crawling Framework", so it could run on its own (own process, own invocation, etc.). At the end, you just need to specify input and then tell Scrapy what to do with the output, and that's how you should configure it to work. Separate the processes that get the information from the ones that consume that information. – eLRuLL Dec 11 '18 at 22:58
1

You can use:

If you need more than that or you have some heavy processing, I suggest that you move this part in a separate process.

Scrapy's responsibility is web parsing, you could for example, in an item pipeline, send tasks to a queue and have a separate process consume and process tasks.

Guillaume
  • 1,814
  • 1
  • 13
  • 29
1

Well, typically speaking, scrapy don't support multiprocess, see

ReactorNotRestartable error in while loop with scrapy

For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.

But, there is some way to workaround.

    pool = Pool(processes=pool_size,maxtasksperchild=1)

maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process.

since the maxtasksperchild is set to 1, so the subprocess will be killed after task finished, a new subprocess will be created and no need to restart task.

But this will cause memory pressure, make sure you do need it. I think start multiply process is a better choice.


I am new to scrapy, so if you have any better suggestions, plz tell me.

Jay
  • 738
  • 8
  • 14