8

I can run crawl in a python script with the following recipe from wiki :

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

As you can see i can just pass the domain to FollowAllSpider but my question is that how can i pass the start_urls (actually a date that will been added to a Fixed url)to my spider class using above code?

this is my spider class:

class MySpider(CrawlSpider):
    name = 'tw'
    def __init__(self,date):
        y,m,d=date.split('-') #this is a test , it could split with regex! 
        try:
            y,m,d=int(y),int(m),int(d)

        except ValueError:
            raise 'Enter a valid date'

        self.allowed_domains = ['mydomin.com']
        self.start_urls = ['my_start_urls{}-{}-{}'.format(y,m,d)]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="result-link"]/span/a/@href') 
        for question in questions:
            item = PoptopItem()
            item['url'] = question.extract()
            yield item['url']

and this is my script :

from pdfcreator import convertor
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
#from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
from poptop.spiders.stackoverflow_spider import MySpider
from poptop.items import PoptopItem

settings = get_project_settings()
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()

date=raw_input('Enter the date with this format (d-m-Y) : ')
print date
spider=MySpider(date=date)
crawler.crawl(spider)
crawler.start()
log.start()
item=PoptopItem()

for url in item['url']:
    convertor(url)

reactor.run() # the script will block here until the spider_closed signal was sent

If i just print the item i'll get the following error :

2015-02-25 17:13:47+0330 [tw] ERROR: Spider must return Request, BaseItem or None, got 'unicode' in <GET test-link2015-1-17>

items:

import scrapy


class PoptopItem(scrapy.Item):
    titles= scrapy.Field()
    content= scrapy.Field()
    url=scrapy.Field()
Mazdak
  • 105,000
  • 18
  • 159
  • 188

1 Answers1

9

You need to modify your __init__() constructor to accept the date argument. Also, I would use datetime.strptime() to parse the date string:

from datetime import datetime

class MySpider(CrawlSpider):
    name = 'tw'
    allowed_domains = ['test.com']

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs) 

        date = kwargs.get('date')
        if not date:
            raise ValueError('No date given')

        dt = datetime.strptime(date, "%m-%d-%Y")
        self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]

Then, you would instantiate the spider this way:

spider = MySpider(date='01-01-2015')

Or, you can even avoid parsing the date at all, passing a datetime instance in the first place:

class MySpider(CrawlSpider):
    name = 'tw'
    allowed_domains = ['test.com']

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs) 

        dt = kwargs.get('dt')
        if not dt:
            raise ValueError('No date given')

        self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]

spider = MySpider(dt=datetime(year=2014, month=01, day=01))

And, just FYI, see this answer as a detailed example about how to run Scrapy from script.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks a lot for explanation! as i said the time parser is a test! and also thanks for link suggestion, now as you can see my `parse` function yield the `url` how can i get that? (after running crawl) – Mazdak Feb 25 '15 at 12:53
  • I used items but it raised `KeyError` seems that it doesn't run the crawl !! `for url in item['url']:` – Mazdak Feb 25 '15 at 13:01
  • @KasraAD I think you just need to `yield item` instead of `yield item['url']`. Let me know if it helped or not. – alecxe Feb 25 '15 at 13:15
  • In my spider i just `yield item` again that error! i will edit the question! i add my script! hope that is could help! – Mazdak Feb 25 '15 at 13:26
  • @KasraAD two things: 1. why are you instantiating an item inside the script where you run the crawling (I think you don't need it here) If you want to post-process an item returned from the spider - do it in the pipeline. 2. can you also show the `PoptopItem` class definition. Thanks. – alecxe Feb 25 '15 at 14:24
  • yep, i add it,actually im new to scrapy! and i dont worked with pipline, but all that i need here is to be able to get the `question.extract()` inside my script! – Mazdak Feb 25 '15 at 14:37