Questions tagged [scrapy]

Scrapy is a multi-threaded open-source high-level screen scraping and web crawling framework written in Python used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data extraction to monitoring and automated testing.

Scrapy is a fast high-level screen and web framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Features of Scrapy include:

  • Designed with simplicity in mind
  • Only need to write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
  • Designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
  • Portable, open-source, 100% Python
  • Written in and runs on Linux, Windows, Mac, and BSD.

History:

Scrapy was born at London-based web-aggregation and e-commerce company Mydeco, where it was developed and maintained by employees of Mydeco and Insophia (a web-consulting company based in Montevideo, Uruguay). The first public release was in August 2008 under the BSD license, with a milestone 1.0 release happening in June 2015. In 2011, Zyte (formerly Scrapinghub) became the new official maintainer.


Installing Scrapy

we can install Scrapy and its dependencies from PyPI with:

pip install Scrapy

or to install Scrapy using conda, run:

conda install -c conda-forge scrapy

Example

Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

    import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.xpath('span/small/text()').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put this in a text file, name it to something like quotes_spider.py and run the spider using the runspider command:

scrapy runspider quotes_spider.py -o quotes.json

Icon

enter image description here


Architecture

Scrapy contains multiple components working together in an event-driven architecture. Main components are Engine, Spider, Scheduler, and Downloader. The data flow between these components is described by details in the official documentation here.


Online resources:

17743 questions
378
votes
3 answers

Headless Browser and scraping - solutions

I'm trying to put list of possible solutions for browser automatic tests suits and headless browser platforms capable of scraping. BROWSER TESTING / SCRAPING: Selenium - polyglot flagship in browser automation, bindings for Python, Ruby, …
Inoperable
  • 1,429
  • 5
  • 17
  • 33
274
votes
26 answers

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html =…
Catherine4j
  • 2,772
  • 2
  • 8
  • 10
245
votes
24 answers

Cannot install Lxml on Mac OS X 10.9

I want to install Lxml so I can then install Scrapy. When I updated my Mac today it wouldn't let me reinstall lxml, I get the following error: In file included from…
David O'Regan
  • 2,684
  • 2
  • 13
  • 12
210
votes
18 answers

"OSError: [Errno 1] Operation not permitted" when installing Scrapy in OSX 10.11 (El Capitan) (System Integrity Protection)

I'm trying to install Scrapy Python framework in OSX 10.11 (El Capitan) via pip. The installation script downloads the required modules and at some point returns the following error: OSError: [Errno 1] Operation not permitted:…
Luis U.
  • 2,500
  • 2
  • 17
  • 15
166
votes
10 answers

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel. Most of the issues are solvable and…
Joseph
  • 3,899
  • 10
  • 33
  • 52
161
votes
9 answers

Difference between BeautifulSoup and Scrapy crawler?

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.
Nishant Bhakta
  • 2,897
  • 3
  • 21
  • 24
125
votes
5 answers

How to pass a user defined argument in scrapy spider

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that? I read about a parameter -a somewhere but have no idea how to use it.
L Lawliet
  • 2,565
  • 4
  • 26
  • 35
113
votes
11 answers

How to use PyCharm to debug Scrapy projects

I am working on Scrapy 0.20 with Python 2.7. I found PyCharm has a good Python debugger. I want to test my Scrapy spiders using it. Anyone knows how to do that please? What I have tried Actually I tried to run the spider as a script. As a result, I…
William Kinaan
  • 28,059
  • 20
  • 85
  • 118
101
votes
11 answers

How can I use different pipelines for different spiders in a single Scrapy project

I have a scrapy project which contains multiple spiders. Is there any way I can define which pipelines to use for which spider? Not all the pipelines i have defined are applicable for every spider. Thanks
CodeMonkeyB
  • 2,970
  • 4
  • 22
  • 29
101
votes
2 answers

selenium with scrapy for dynamic page

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this: starts with a product_list page with 10 products a click on "next" button loads the next 10 products (url doesn't change between the…
Z. Lin
  • 1,422
  • 3
  • 12
  • 16
86
votes
8 answers

How to run Scrapy from within a Python script

I'm new to Scrapy and I'm looking for a way to run it from a Python script. I found 2 sources that explain this: http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/ http://snipplr.com/view/67006/using-scrapy-from-a-script/ I can't…
user47954
  • 969
  • 1
  • 7
  • 4
85
votes
6 answers

TypeError: Object of type 'bytes' is not JSON serializable

I just started programming Python. I want to use scrapy to create a bot,and it showed TypeError: Object of type 'bytes' is not JSON serializable when I run the project. import json import codecs class W3SchoolPipeline(object): def…
Zhibin
  • 869
  • 1
  • 6
  • 4
75
votes
10 answers

Scrapy Unit Testing

I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing…
ciferkey
  • 2,064
  • 3
  • 20
  • 28
74
votes
3 answers

getting Forbidden by robots.txt: scrapy

while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/> ERROR: No response downloaded for: https://www.netflix.com/
deepak kumar
  • 743
  • 1
  • 5
  • 4
69
votes
4 answers

Click a Button in Scrapy

I'm using Scrapy to crawl a webpage. Some of the information I need only pops up when you click on a certain button (of course also appears in the HTML code after clicking). I found out that Scrapy can handle forms (like logins) as shown here. But…
naeg
  • 3,944
  • 3
  • 24
  • 29
1
2 3
99 100