A web crawler in a self-contained python file

Question

I have found lots of Scrapy tutorials (such as this good tutorial) that all need the steps listed below. The result is a project, with lots of files (project.cfg + some .py files + a specific folder structure).

How to make the steps (listed below) work as a self-contained python file that can be run with python mycrawler.py ?

(instead of a full project with lots of files, some .cfg files, etc., and having to use scrapy crawl myproject -o myproject.json... by the way, it seems that scrapy is a new shell command? is this true?)

Note: here could be an answer to this question but unfortunately it is deprecated and no longer works.

1) Create a new scrapy project with scrapy startproject myproject

2) Define the data structure with Item like this:

from scrapy.item import Item, Field
    class MyItem(Item):
        title = Field() 
        link = Field()
        ...

3) Define the crawler with

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
    name = "myproject"
    allowed_domains = ["example.com"] 
    start_urls = ["http://www.example.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        ...

4) Run with:

scrapy crawl myproject -o myproject.json

How to make the steps (listed below) work as a self-contained python file that can be run with python mycrawler.py ? Can you explain this statement briefly.. — aibotnet, Jan 02 '15 at 13:02
@vikasdumca the fact is that these tutorial (http://amaral-lab.org/blog/quick-introduction-web-crawling-using-scrapy-part-) don't show how to make a runnable-in-one-script code. I would like to be able to do http://amaral-lab.org/blog/quick-introduction-web-crawling-using-scrapy-part- in one single file `testcrawler.py` and be able to run it with `python testcrawler.py` instead of having to run `scrapy ...` — Basj, Jan 02 '15 at 13:38

pad · Accepted Answer · 2015-01-02T18:06:13.080

3

You can run scrapy spiders as a single script without starting a project by using runspider Is this what you wanted?

#myscript.py
from scrapy.item import Item, Field
from scrapy import Spider

class MyItem(Item):
    title = Field() 
    link = Field()

class MySpider(Spider):

     start_urls = ['http://www.example.com']
     name = 'samplespider'

     def parse(self, response):
          item = MyItem()
          item['title'] = response.xpath('//h1/text()').extract()
          item['link'] = response.url
          yield item

Now you can run this with scrapy runspider myscript.py -o out.json

edited Jan 02 '15 at 18:06

answered Jan 02 '15 at 17:58

pad

2,296
2
16
23

Thanks @pad. I would like to be able to run it with `python myscript.py`. What do I need to add to be able to run it like this? – Basj Jan 02 '15 at 22:41
1

@Basj look at the very comprehensive answer [by alecxe](http://stackoverflow.com/a/27744766/4065855) – pad Jan 03 '15 at 04:50

score 1 · Answer 2 · answered Jan 02 '15 at 16:03

Scrapy is not unix command it just executable like python,javac,gcc etc.
bcz u are using framework for this you have to use command given provided by framework. one thing you can do is create a bash script and simply execute whenever you need or execute it from some other program program.

you can write crawler using urllib3 its simple

A web crawler in a self-contained python file

2 Answers2