Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

Designed with simplicity in mind
Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable, open-source, 100% Python
Written in python and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.

Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using condaconda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.

Online resources:

Official site
Official docs
Git Repository
FAQ (see also Recent tab of scrapy tag)
Tutorial for beginners
Curated Scrapy links (libraries, related projects, etc)

17743 questions

378

votes

3 answers

Headless Browser and scraping - solutions

I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping. BROWSER TESTING / SCRAPING: Selenium - polyglot flagship in browser automation, bindings for Python, Ruby, …

selenium web-scraping scrapy phantomjs casperjs

asked Aug 30 '13 at 18:38

Inoperable

1,429
5
17
33

274

votes

26 answers

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html =…

python web-scraping beautifulsoup scrapy ssl-certificate

asked May 08 '18 at 14:32

Catherine4j

2,772
2
8
10

245

votes

24 answers

Cannot install Lxml on Mac OS X 10.9

I want to install Lxml so I can then install Scrapy. When I updated my Mac today it wouldn't let me reinstall lxml, I get the following error: In file included from…

python xcode macos scrapy lxml

asked Oct 23 '13 at 17:07

David O'Regan

2,684
2
13
12

210

votes

18 answers

"OSError: [Errno 1] Operation not permitted" when installing Scrapy in OSX 10.11 (El Capitan) (System Integrity Protection)

I'm trying to install Scrapy Python framework in OSX 10.11 (El Capitan) via pip. The installation script downloads the required modules and at some point returns the following error: OSError: [Errno 1] Operation not permitted:…

python macos python-2.7 scrapy

asked Aug 09 '15 at 01:00

Luis U.

2,500
2
17
15

166

votes

10 answers

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel. Most of the issues are solvable and…

javascript python ajax screen-scraping scrapy

asked Dec 18 '11 at 06:03

Joseph

3,899
10
33
52

161

votes

9 answers

Difference between BeautifulSoup and Scrapy crawler?

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.

python beautifulsoup scrapy web-crawler

asked Oct 30 '13 at 15:43

Nishant Bhakta

2,897
3
21
24

125

votes

5 answers

How to pass a user defined argument in scrapy spider

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that? I read about a parameter -a somewhere but have no idea how to use it.

python scrapy web-crawler

asked Mar 25 '13 at 09:35

L Lawliet

2,565
4
26
35

113

votes

11 answers

How to use PyCharm to debug Scrapy projects

I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please? What I have tried Actually I tried to run the spider as a script. As a result, I…

python debugging python-2.7 scrapy pycharm

asked Feb 14 '14 at 20:27

William Kinaan

28,059
20
85
118

101

votes

11 answers

How can I use different pipelines for different spiders in a single Scrapy project

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider. Thanks

python scrapy web-crawler

asked Dec 04 '11 at 02:08

CodeMonkeyB

2,970
4
22
29

101

votes

2 answers

selenium with scrapy for dynamic page

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this: starts with a product_list page with 10 products a click on "next" button loads the next 10 products (url doesn't change between the…

python selenium selenium-webdriver web-scraping scrapy

asked Jul 31 '13 at 16:08

Z. Lin

1,422
3
12
16

votes

8 answers

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this: http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/ http://snipplr.com/view/67006/using-scrapy-from-a-script/ I can't…

python web-scraping web-crawler scrapy

asked Nov 18 '12 at 04:09

user47954

votes

6 answers

TypeError: Object of type 'bytes' is not JSON serializable

I just started programming Python. I want to use scrapy to create a bot，and it showed TypeError: Object of type 'bytes' is not JSON serializable when I run the project. import json import codecs class W3SchoolPipeline(object): def…

python json serialization scrapy

asked Jun 21 '17 at 16:54

Zhibin

votes

10 answers

Scrapy Unit Testing

I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing…

python unit-testing scrapy nose

asked Jun 23 '11 at 15:08

ciferkey

2,064
3
20
28

votes

3 answers

getting Forbidden by robots.txt: scrapy

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/> ERROR: No response downloaded for: https://www.netflix.com/

python scrapy web-crawler

asked May 17 '16 at 11:28

deepak kumar

votes

4 answers

Click a Button in Scrapy

I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here. But…

python web-crawler web-scraping scrapy

asked Jul 13 '11 at 16:45

naeg

3,944
3
24
29

2 3

…

99 100 Next