How can I combine a simple project with scrapy project?

Question

I have an example of a scrapy project. it is pretty much default. its folder structure:

craiglist_sample/
├── craiglist_sample
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── test.py
└── scrapy.cfg

When you write scrapy crawl craigs -o items.csv -t csv to windows command prompt it writes craiglist items and links to console.

I want to create an example.py in main folder and print these items to python console inside it.

I tried

 from scrapy import cmdline
cmdline.execute("scrapy crawl craigs".split())

but it writes the same as windows shell output. How can I make it print only items and list?

test.py :

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craiglist_sample.items import CraiglistSampleItem

class MySpider(CrawlSpider):
    name = "craigs"
##    allowed_domains = ["sfbay.craigslist.org"]
##    start_urls = ["http://sfbay.craigslist.org/npo/"]
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.tr.craigslist.org/search/npo?"]

##search\/npo\?s=
    rules = (Rule (SgmlLinkExtractor(allow=('s=\d00',),restrict_xpaths=('//a[@class="button next"]',))
    , callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//span[@class="pl"]')
##        titles = hxs.select("//p[@class='row']")
        items = []
        for titles in titles:
            item = CraiglistSampleItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return(items)

score 0 · Accepted Answer · edited May 23 '17 at 12:12

0

An approach could be turning off the default shell output of scrapy and insert a print command inside your parse_items function.

1 - Turn off the debug level in file settings.py

LOG_ENABLED = False

Documentation about logging levels in Scrapy here: http://doc.scrapy.org/en/latest/topics/logging.html

2 - Add a print command for the items you are interested

for titles in titles:
        item = CraiglistSampleItem()
        item ["title"] = titles.select("a/text()").extract()
        item ["link"] = titles.select("a/@href").extract()
        items.append(item)
        print item ["title"], item ["link"]

The shell output will be:

[u'EXECUTIVE ASSISTANT'] [u'/eby/npo/4848086929.html']

[u'Direct Support Professional'] [u'/eby/npo/4848043371.html']

[u'Vocational Counselor'] [u'/eby/npo/4848042572.html']

[u'Day Program Supervisor'] [u'/eby/npo/4848041846.html']

[u'Educational Specialist'] [u'/eby/npo/4848040348.html']

[u'ORGANIZE WITH GREENPEACE - Grassroots Nonprofit Job!']

[u'/eby/npo/4847984654.html']

EDIT Code for executing from a script

import os
os.system('scrapy crawl craigs > log.txt')

There are several other ways for executing line program within python. Check Executing command line programs from within python and Calling an external command in Python

edited May 23 '17 at 12:12

Community

1
1

answered Jan 21 '15 at 08:20

aberna

5,594
2
28
33

thanks for your answer but I need to run from script. I found this page http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script . that testspider seems to work if I create a .py file in that directory. could you please modify this spider https://github.com/scrapinghub/testspiders/blob/master/testspiders/spiders/followall.py for my spider "MySpider" ? – St3114 Jan 21 '15 at 10:32
the proposed modification it is already integrated with your work. Use the script you wrote down: "from scrapy import cmdline cmdline.execute("scrapy crawl craigs".split())" – aberna Jan 21 '15 at 10:53
sadly no. my intention was to run scrapy from a py script. save that items to a file. anyway I accept this answer but i ll be glad if you help me with that – St3114 Jan 23 '15 at 11:18
@St3114 can you add more details (the log) about what it is not working ? What happen if you redirect the output to a file (scrapy crawl craigs > result.txt) ? – aberna Jan 23 '15 at 18:41

How can I combine a simple project with scrapy project?

1 Answers1