5

I want to build a crawler which takes the URL of a webpage to be scraped and returns the result back to a webpage. Right now I start scrapy from the terminal and store the response in a file. How can I start the crawler when some input is posted on to Flask, process, and return a response back?

digenishjkl
  • 154
  • 1
  • 12
Ashish
  • 647
  • 7
  • 18
  • Sorry, that last line is a little fuzzy. What are you doing with Flask? What process? And return the response back to where? – nivix zixer Jul 24 '15 at 04:01
  • I'm using FLASK to expose the endpoints, so that from a web-app someone can post an input i.e. the webpage link to be scraped. Then, I want to start the spider and pass that input and return the crawler response back to web-app. – Ashish Jul 24 '15 at 04:06
  • I just answered similar question here: https://stackoverflow.com/questions/36384286/how-to-integrate-flask-scrapy – Pawel Miech May 17 '16 at 08:14

1 Answers1

3

You need to create a CrawlerProcess inside your Flask application and run the crawl programmatically. See the docs.

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # The script will block here until the crawl is finished

Before moving on with your project I advise you to look into a Python task queue (like rq). This will allow you to run Scrapy crawls in the background and your Flask application will not freeze while the scrapes are running.

nivix zixer
  • 1,611
  • 1
  • 13
  • 19
  • I have used it under scrapy. Will you please provide some code snippet, which is running spider under flask application???? – Vasim Aug 12 '15 at 09:33