7

I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important.

Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic.

How do I speed scrapy up? I really like the framework, but this performance issue could be a deal-breaker for me.

Here's the relevant part of the settings.py file. Is there some important setting I've missed?

LOG_ENABLED = False
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_IP = 8

A few parameters:

  • Using scrapy version 0.14
  • The project is deployed on an EC2 large instance, so there should be plenty of memory, CPU, and bandwidth to play with.
  • I'm scheduling crawls using the JSON protocol, keeping the crawler topped up with a few dozen concurrent crawls at any given time.
  • As I said at the beginning, I'm downloading pages from many sites, so remote server performance and CONCURRENT_REQUESTS_PER_IP shouldn't be a worry.
  • For the moment, I'm doing very little post-processing. No xpath; no regex; I'm just saving the url and a few basic statistics for each page. (This will change later once I get the basic performance kinks worked out.)
Abe
  • 22,738
  • 26
  • 82
  • 111
  • Scrapy does go a lot faster. Is it CPU bound or does it appear idle? is it slow from the start, or does performance degrade? – Shane Evans Sep 14 '12 at 16:03
  • It's slow from the start. AWS shows the machine running at 100% CPU, but I think the twisted reactor always does that. The machine is still snappy and responsive to SSH commands, new HTTP requests, etc. – Abe Sep 14 '12 at 16:05
  • 1
    Working on this for the last hour, I've got a hunch the problem is in the service configuration files for scrapyd. I've started a separate question about restarting the scrapy daemon: http://stackoverflow.com/questions/12428143/how-do-i-restart-the-scrapyd-daemon – Abe Sep 14 '12 at 16:07
  • Hmm, May be the html is too complicate for scrapy extractor to parse. Try using `lxml` instead. – Kien Truong Sep 14 '12 at 16:11
  • @Dikei I'm pretty sure scrapy already uses lxml. Any anyway, like I said in the last bullet point in the question, I'm not actually parsing any HTML. – Abe Sep 14 '12 at 16:17
  • scrapy uses lxml (or libxml2 depending on version), rewriting to use lxml wouldn't help... but you could be right about complicated parsing, or xpath, or something. Profiling is really the only way to tell. Twisted does not consume 100% CPU always - if scrapy is crawling slowly/politely it's usually nearly idle. – Shane Evans Sep 14 '12 at 16:19
  • Okay, how do you profile within scrapy? – Abe Sep 14 '12 at 16:25
  • Well, scrapy can use `lxml` if it's installed, but it has its own built-in python-only parser. It doesn't really matter though, since you don't use it. But 100% CPU is definitely very odd, I've never reach 100% CPU with scrapy before, usually my bandwidth reach its limit first. – Kien Truong Sep 14 '12 at 16:25
  • Okay, the 100% CPU was a red herring. I've gone back and checked with mpstat. It's 94% idle, even when I queue 50 spiders at once. It must be some kind of issue with the way Amazon handles its virtualization. Anyway, the main point is that CPU is *not* the problem. – Abe Sep 14 '12 at 16:35
  • scrapyd has a limit of the number of spiders to run concurrently, so if you want to run more you'll need to change that limit. If the individual scrapy crawls are slow and not CPU bound, then it's likely to be configuration. In particular, check there is no download delay set (either in your spider, or in a setting) and your settings above are picked up correctly. – Shane Evans Sep 14 '12 at 18:04
  • i think celery can help you in this situation also look [this][1] [1]: http://stackoverflow.com/questions/11528739/running-scrapy-spiders-in-a-celery-task – akhter wahab Sep 17 '12 at 09:23

1 Answers1

9

I had this problem in the past... And large part of it I solved with a 'Dirty' old tricky.

Do a local cache DNS.

Mostly when you have this high cpu usage accessing simultaneous remote sites it is because scrapy is trying to resolve the urls.

And please remember to change your dns settings at the host (/etc/resolv.conf) to your LOCAL caching DNS server.

In the first ones will be slowly, but as soon it start caching and it is more efficient resolving you are going to see HUGE improvements.

I hope this will help you in your problem!

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Carlos Henrique Cano
  • 1,458
  • 11
  • 15
  • 5
    According to [scrapy doc](http://doc.scrapy.org/en/latest/topics/settings.html#dnscache-enabled), `DNSCACHE_ENABLED` is `True` by default. – AliBZ Jan 08 '14 at 21:40