Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:
- Designed with simplicity in mind
- Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
- Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
- Portable, open-source, 100% Python
- Written in python and runs on Linux, Windows, Mac, and BSD.
History:
Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.
Installing Scrapy
we can install Scrapy and its dependencies from PyPI with:
pip install Scrapy
or to install Scrapy using condaconda, run:
conda install -c conda-forge scrapy
Example
Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.xpath('span/small/text()').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:
scrapy runspider quotes_spider.py -o quotes.json
Icon
Architecture
Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine
, Spider
, Scheduler
, and Downloader
. The data flow between these components is described by details in the official documentation here.
Online resources:
- Official site
- Official docs
- Git Repository
- FAQ (see also Recent tab of scrapy tag)
- Tutorial for beginners
- Curated Scrapy links (libraries, related projects, etc)