0

I have an example of a scrapy project. it is pretty much default. its folder structure:

craiglist_sample/
├── craiglist_sample
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── test.py
└── scrapy.cfg

When you write scrapy crawl craigs -o items.csv -t csv to windows command prompt it writes craiglist items and links to console.

I want to create an example.py in main folder and print these items to python console inside it.

I tried

 from scrapy import cmdline
cmdline.execute("scrapy crawl craigs".split())

but it writes the same as windows shell output. How can I make it print only items and list?

test.py :

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craiglist_sample.items import CraiglistSampleItem

class MySpider(CrawlSpider):
    name = "craigs"
##    allowed_domains = ["sfbay.craigslist.org"]
##    start_urls = ["http://sfbay.craigslist.org/npo/"]
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.tr.craigslist.org/search/npo?"]

##search\/npo\?s=
    rules = (Rule (SgmlLinkExtractor(allow=('s=\d00',),restrict_xpaths=('//a[@class="button next"]',))
    , callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//span[@class="pl"]')
##        titles = hxs.select("//p[@class='row']")
        items = []
        for titles in titles:
            item = CraiglistSampleItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return(items)
St3114
  • 55
  • 9

1 Answers1

0

An approach could be turning off the default shell output of scrapy and insert a print command inside your parse_items function.

1 - Turn off the debug level in file settings.py

LOG_ENABLED = False

Documentation about logging levels in Scrapy here: http://doc.scrapy.org/en/latest/topics/logging.html

2 - Add a print command for the items you are interested

for titles in titles:
        item = CraiglistSampleItem()
        item ["title"] = titles.select("a/text()").extract()
        item ["link"] = titles.select("a/@href").extract()
        items.append(item)
        print item ["title"], item ["link"]

The shell output will be:

[u'EXECUTIVE ASSISTANT'] [u'/eby/npo/4848086929.html']

[u'Direct Support Professional'] [u'/eby/npo/4848043371.html']

[u'Vocational Counselor'] [u'/eby/npo/4848042572.html']

[u'Day Program Supervisor'] [u'/eby/npo/4848041846.html']

[u'Educational Specialist'] [u'/eby/npo/4848040348.html']

[u'ORGANIZE WITH GREENPEACE - Grassroots Nonprofit Job!']

[u'/eby/npo/4847984654.html']

EDIT Code for executing from a script

import os
os.system('scrapy crawl craigs > log.txt')

There are several other ways for executing line program within python. Check Executing command line programs from within python and Calling an external command in Python

Community
  • 1
  • 1
aberna
  • 5,594
  • 2
  • 28
  • 33
  • thanks for your answer but I need to run from script. I found this page http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script . that testspider seems to work if I create a .py file in that directory. could you please modify this spider https://github.com/scrapinghub/testspiders/blob/master/testspiders/spiders/followall.py for my spider "MySpider" ? – St3114 Jan 21 '15 at 10:32
  • the proposed modification it is already integrated with your work. Use the script you wrote down: "from scrapy import cmdline cmdline.execute("scrapy crawl craigs".split())" – aberna Jan 21 '15 at 10:53
  • sadly no. my intention was to run scrapy from a py script. save that items to a file. anyway I accept this answer but i ll be glad if you help me with that – St3114 Jan 23 '15 at 11:18
  • @St3114 can you add more details (the log) about what it is not working ? What happen if you redirect the output to a file (scrapy crawl craigs > result.txt) ? – aberna Jan 23 '15 at 18:41