I have found lots of Scrapy
tutorials (such as this good tutorial) that all need the steps listed below. The result is a project, with lots of files (project.cfg
+ some .py
files + a specific folder structure).
How to make the steps (listed below) work as a self-contained python file that can be run with python mycrawler.py
?
(instead of a full project with lots of files, some .cfg files, etc., and having to use scrapy crawl myproject -o myproject.json
... by the way, it seems that scrapy
is a new shell command? is this true?)
Note: here could be an answer to this question but unfortunately it is deprecated and no longer works.
1) Create a new scrapy project with scrapy startproject myproject
2) Define the data structure with Item
like this:
from scrapy.item import Item, Field
class MyItem(Item):
title = Field()
link = Field()
...
3) Define the crawler with
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "myproject"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
...
4) Run with:
scrapy crawl myproject -o myproject.json