How do I get scrapy pipeline to fill my mongodb with my items?

Question

How do I get scrapy pipeline to fill my mongodb with my items? Here is what my code looks like at the moment which is a reflection of the information i got off of scrapy documentation. I also want to mention that I have tried returning items instead of yielding, as well tried using item loaders. All methods seem to have the same outcome. on that note I want to mention that if I run the command mongoimport --db mydb --collection mycoll --drop --jsonArray --file ~/path/to/scrapyoutput.json my database gets populated(as long as I yield and don't return items)... I would really love to get this pipeline working though...

okay so here is my code:

here is my spider

    import scrapy

    from scrapy.selector import Selector
    from scrapy.loader import ItemLoader
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    from scrapy.http import HtmlResponse
    from capstone.items import CapstoneItem

    class CongressSpider(CrawlSpider):
        name = "congress"
        allowed_domains = ["www.congress.gov"]
        start_urls = [
            'https://www.congress.gov/members',
        ]
    #creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links.
    rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),)

    def parse_page(self, response):
        for search in response.selector.xpath(".//li[@class='compact']"):
            yield {
                'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(),
                'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(),
                'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(),
                'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(),
                'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(),
            }

settings:

    BOT_NAME = 'capstone'

    SPIDER_MODULES = ['capstone.spiders']
    NEWSPIDER_MODULE = 'capstone.spiders'

    ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}
    MONGO_URI = 'mongodb://localhost:27017'
    MONGO_DATABASE = 'congress'
    ROBOTSTXT_OBEY = True
    DOWNLOAD_DELAY = 10

here is my pipeline.py import pymongo

    from pymongo import MongoClient
    from scrapy.conf import settings
    from scrapy.exceptions import DropItem
    from scrapy import log

    class MongoDBPipeline(object):
        collection_name= 'members'
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get('MONGO_URI')
                mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
            )
        def open_spider(self,spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
        def close_spider(self, spider):
            self.client.close()
        def process_item(self, item, spider):
            self.db[self.collection_name].insert(dict(item))
            return item

here is items.py import scrapy

    class CapstoneItem(scrapy.Item):
        member = scrapy.Field()
        state = scrapy.Field()
        District = scrapy.Field()
        party = scrapy.Field()
        served = scrapy.Field()

last but not least my output looks like this:

    2017-02-26 20:44:41 [scrapy.core.engine] INFO: Closing spider (finished)
    2017-02-26 20:44:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 8007,
    'downloader/request_count': 24,
    'downloader/request_method_count/GET': 24,
    'downloader/response_bytes': 757157,
    'downloader/response_count': 24,
    'downloader/response_status_count/200': 24,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2017, 2, 27, 4, 44, 41, 767181),
    'item_scraped_count': 2139,
    'log_count/DEBUG': 2164,
    'log_count/INFO': 11,
    'request_depth_max': 22,
    'response_received_count': 24,
    'scheduler/dequeued': 23,
    'scheduler/dequeued/memory': 23,
    'scheduler/enqueued': 23,
    'scheduler/enqueued/memory': 23,
    'start_time': datetime.datetime(2017, 2, 27, 4, 39, 58, 834315)}
    2017-02-26 20:44:41 [scrapy.core.engine] INFO: Spider closed (finished)

so it seems to me like I am not getting any errors, my items were scraped. if i had ran it with a -o myfile.json i could import myfile to my mongodb but the pipeline just isn't doing anything!

     mongo
     MongoDB shell version: 3.2.12
     connecting to: test
     Server has startup warnings: 
      2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten]                              2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **    WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] 
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] 
     > show dbs
     congress  0.078GB
     local     0.078GB
     > use congress
     switched to db congress
     > show collections
     members
     system.indexes
     > db.members.count()
     0
     >

I suspect my problem has to do with my settings file. I am new with scrapy and mongodb and I have a feeling I haven't told scrapy where my mongodb is correctly. here are some other sources I found, I tried using them as examples but everything I tried just lead to the same result(scraping was done, mongo was empty) https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/ https://github.com/sebdah/scrapy-mongodb I have a bunch more sources but not enough reputation to post more unfortunately. anyway~ any thoughts would be much appreciated thanks.

You have a typo in your `MongoDBPipeline`: `def open_sipder(self,spider):` should be `open_spider` — Granitosaurus, Feb 27 '17 at 06:14
Also at `MongoDBPipeline(object): from_crawler(cls, crawler):` the two arguments of the `return cls()` statement should be comma separated. Regardless if this is the last step to go or not, I suggest http://stackoverflow.com/questions/299704/what-are-good-ways-to-make-my-python-code-run-first-time and http://stackoverflow.com/questions/1623039/python-debugging-tips for some tips on basic testing/debugging when writing python scripts. — thanasisp, Mar 03 '17 at 07:30
thanks! I ended up finding this error when the script actually ran. I appropriate the debug material. — Hannah Schipper, Mar 03 '17 at 20:53

score 2 · Accepted Answer · answered Mar 03 '17 at 20:51

I commented out my line of code that said

ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}

and I uncommented the line of code that was already inside the settings file

ITEM_PIPLINES = { 'capstone.pipelines.MongoDBPipeline': 300, }

the only difference I can see is the newlines and this setting was set well below all my other settings. after getting this to work I started getting python errors about the typos in my pipeline file. I figured out that my pipeline wasn't connecting because of the output before my items were being scraped:

[scrapy.middleware] INFO: Enabled item pipelines:[]

after changed my settings i got this:

[scrapy.middleware] INFO: Enabled item piplines:['capstone.pipelines.MongoDBPipeline']

score 0 · Answer 2 · answered Feb 28 '17 at 22:18

0

Typo where you're setting the DB name:

mongo_db=crawer.settings.get('MONGO_DATABASE', 'items')

should be

mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')

Hopefully that works!

answered Feb 28 '17 at 22:18

zyshara

453
1
4
7

How do I get scrapy pipeline to fill my mongodb with my items?

2 Answers2