How do I get scrapy pipeline to fill my mongodb with my items? Here is what my code looks like at the moment which is a reflection of the information i got off of scrapy documentation.
I also want to mention that I have tried returning items instead of yielding, as well tried using item loaders. All methods seem to have the same outcome.
on that note I want to mention that if I run the command
mongoimport --db mydb --collection mycoll --drop --jsonArray --file ~/path/to/scrapyoutput.json
my database gets populated(as long as I yield and don't return items)... I would really love to get this pipeline working though...
okay so here is my code:
here is my spider
import scrapy
from scrapy.selector import Selector
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
from capstone.items import CapstoneItem
class CongressSpider(CrawlSpider):
name = "congress"
allowed_domains = ["www.congress.gov"]
start_urls = [
'https://www.congress.gov/members',
]
#creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links.
rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),)
def parse_page(self, response):
for search in response.selector.xpath(".//li[@class='compact']"):
yield {
'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(),
'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(),
'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(),
'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(),
'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(),
}
settings:
BOT_NAME = 'capstone'
SPIDER_MODULES = ['capstone.spiders']
NEWSPIDER_MODULE = 'capstone.spiders'
ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'congress'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 10
here is my pipeline.py import pymongo
from pymongo import MongoClient
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
collection_name= 'members'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI')
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self,spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item
here is items.py import scrapy
class CapstoneItem(scrapy.Item):
member = scrapy.Field()
state = scrapy.Field()
District = scrapy.Field()
party = scrapy.Field()
served = scrapy.Field()
last but not least my output looks like this:
2017-02-26 20:44:41 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-26 20:44:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 8007,
'downloader/request_count': 24,
'downloader/request_method_count/GET': 24,
'downloader/response_bytes': 757157,
'downloader/response_count': 24,
'downloader/response_status_count/200': 24,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 27, 4, 44, 41, 767181),
'item_scraped_count': 2139,
'log_count/DEBUG': 2164,
'log_count/INFO': 11,
'request_depth_max': 22,
'response_received_count': 24,
'scheduler/dequeued': 23,
'scheduler/dequeued/memory': 23,
'scheduler/enqueued': 23,
'scheduler/enqueued/memory': 23,
'start_time': datetime.datetime(2017, 2, 27, 4, 39, 58, 834315)}
2017-02-26 20:44:41 [scrapy.core.engine] INFO: Spider closed (finished)
so it seems to me like I am not getting any errors, my items were scraped. if i had ran it with a -o myfile.json i could import myfile to my mongodb but the pipeline just isn't doing anything!
mongo
MongoDB shell version: 3.2.12
connecting to: test
Server has startup warnings:
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] 2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]
> show dbs
congress 0.078GB
local 0.078GB
> use congress
switched to db congress
> show collections
members
system.indexes
> db.members.count()
0
>
I suspect my problem has to do with my settings file. I am new with scrapy and mongodb and I have a feeling I haven't told scrapy where my mongodb is correctly. here are some other sources I found, I tried using them as examples but everything I tried just lead to the same result(scraping was done, mongo was empty) https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/ https://github.com/sebdah/scrapy-mongodb I have a bunch more sources but not enough reputation to post more unfortunately. anyway~ any thoughts would be much appreciated thanks.