Keeping streams of data separate using one Scrapy spider

Question

I want to scrape data from three different categories of contracts --- goods, services, construction.

Because each type of contract can be parsed with the same method, my goal is to use a single spider, start the spider on three different urls, and then extract data in three distinct streams that can be saved to different places.

My understanding is that just listing all three urls as start_urls will lead to one combined output of data.

My spider inherits from Scrapy's CrawlSpider class.

Let me know if you need further information.

score 1 · Accepted Answer · edited May 23 '17 at 12:25

I would suggest that you tackle this problem from another angle. In scrapy it is possible to pass arguments to the spider from the command line using the -a option like so

scrapy crawl CanCrawler -a contract=goods

You just need to include the variables you reference in your class initializer

class CanCrawler(scrapy.Spider):
    name = 'CanCrawler'
    def __init__(self, contract='', *args, **kwargs):
        super(CanCrawler, self).__init__(*args, **kwargs)
        self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
        # ...

Something else you might consider is adding multiple arguments so that you can start on the homepage of a website and using the arguments, you can get to whatever data you need. For the case of this website https://buyandsell.gc.ca/procurement-data/search/site, for example you could have two command line arguments.

    scrapy crawl CanCrawler -a procure=ContractHistory -a contract=goods

so you'd get

class CanCrawler(scrapy.Spider):
    name = 'CanCrawler'
    def __init__(self, procure='', contract='', *args, **kwargs):
        super(CanCrawler, self).__init__(*args, **kwargs)
        self.start_urls = ['https://buyandsell.gc.ca/procurement-data/search/site']
        # ...

and then depending on what arguments you passed, you could make your crawler click on those options on the website to get to the data that you want to crawl. Please also see here. I hope this helps!

Certainly a different approach... but an EFFECTIVE one. I only need to specify the contract type, so I will not include the `procure`, `*args`, or `**kwargs` arguments. Thanks for your time and insight. — p_sutherland, Mar 02 '17 at 04:13

score 0 · Answer 2 · answered Mar 01 '17 at 10:21

0

In your Spider, yield your item like this.

data = {'categories': {}, 'contracts':{}, 'goods':{}, 'services':{}, 'construction':{} }

Where each of item consists a Python dictionary.

And then create a Pipeline, and inside pipeline, do this.

if 'categories' in item:
   categories = item['categories']
   # and then process categories, save into DB maybe

if 'contracts' in item:
   categories = item['contracts']
   # and then process contracts, save into DB maybe
.
.
.
# And others

answered Mar 01 '17 at 10:21

Umair Ayub

19,358
14
72
146

I appreciate the reply. I was not clear enough in my description of the problem, so the answer you provided does not solve it. The fault is my own. – p_sutherland Mar 02 '17 at 04:12

Keeping streams of data separate using one Scrapy spider

2 Answers2