3

Is it possible to run Scrapy on PyPy? I've looked through the documentation and the github project, but the only place where PyPy is mentioned is that there were some unit tests being executed on PyPy 2 years ago, see PyPy support. There is also Scrapy fails in PyPy long discussion happened 3 years ago without a concrete resolution or a follow-up.

From what I understand, the main Scrapy's dependency Twisted is known to work on PyPy. Scrapy also uses lxml for HTML parsing, which has a PyPy-friendly fork. The other dependency, pyOpenSSL is fully supported (thanks to @Glyph's comment).

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    `pyOpenSSL` is fully supported: https://travis-ci.org/pyca/pyopenssl/jobs/66236395 – Glyph Jun 24 '15 at 21:21
  • 1
    Just try it out. There is no need to ask a question like this. – hellow Jun 30 '15 at 13:43
  • 1
    @cookiesoft first of all, from the research I've made, this is a problem that is not covered in the documentation or elsewhere. Also, potentially, it may have a positive impact on the Scrapy performance. Besides, and most importantly, I hope this topic would help others with a similar question, or others who would try to find a way to improve the speed of the web-scraping code they wrote. Please mouse over the downvote button and check again whether it fits the description. – alecxe Jun 30 '15 at 13:51

1 Answers1

5

Yes. :-)

In a bit more detail, I already had a version of pypy 2.6.0 (with pip) installed on my box. Simply running pip install scrapy nearly just worked for me. Turns out I needed some extra libraries for lxml. After that it was fine.

Once installed, I could run the dmoz tutorial. For example:

[user@localhost scrapy_proj]# scrapy crawl dmoz
2015-06-30 14:34:45 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapy_proj)
2015-06-30 14:34:45 [scrapy] INFO: Optional features available: ssl, http11
2015-06-30 14:34:45 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'scrapy_proj', 'NEWSPIDER_MODULE': 'scrapy_proj.spiders', 'SPIDER_MODULES': ['scrapy_proj.spiders']}
2015-06-30 14:34:45 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.

2015-06-30 14:34:45 [scrapy] INFO: Enabled extensions: CoreStats, TelnetConsole, CloseSpider, LogStats, SpiderState
2015-06-30 14:34:45 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-30 14:34:45 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-30 14:34:45 [scrapy] INFO: Enabled item pipelines: 
2015-06-30 14:34:45 [scrapy] INFO: Spider opened
2015-06-30 14:34:45 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-30 14:34:45 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-30 14:34:46 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2015-06-30 14:34:46 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2015-06-30 14:34:46 [scrapy] INFO: Closing spider (finished)
2015-06-30 14:34:46 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 16286,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 30, 13, 34, 46, 219002),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 6, 30, 13, 34, 45, 652421)}
2015-06-30 14:34:46 [scrapy] INFO: Spider closed (finished)

And as requested, here's some more info on the version I'm running:

[user@localhost scrapy_proj]# which scrapy
/opt/pypy/bin/scrapy
[user@localhost scrapy_proj]# scrapy version
2015-06-30 15:04:42 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapy_proj)
2015-06-30 15:04:42 [scrapy] INFO: Optional features available: ssl, http11
2015-06-30 15:04:42 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'scrapy_proj', 'NEWSPIDER_MODULE': 'scrapy_proj.spiders', 'SPIDER_MODULES': ['scrapy_proj.spiders']}
Scrapy 1.0.0
Peter Brittain
  • 13,489
  • 3
  • 41
  • 57
  • Thanks for the answer! Could you recheck where the scrapy points to? Please run `which scrapy` and post the output you've got. – alecxe Jun 30 '15 at 14:00
  • Done - see the bottom of my answer. – Peter Brittain Jun 30 '15 at 14:07
  • Good! I was able to make `Scrapy` work with `PyPy` too. It works for now. I'll try it running on a different spiders and in conjunction with `scrapyjs` and `selenium` and, I hope, I'll report the results in this topic. Thank you, will award the bounty in an hour. – alecxe Jun 30 '15 at 14:23