I have been working with python and scrapy for the past week using the following tutorial: https://realpython.com/web-scraping-with-scrapy-and-mongodb/
What this tutorial does is it goes through scraping the top questions and their urls on stackoverflow with the scrapy web crawler then stores it into a mongoDB database and collection.
I'm trying to adapt what has been done in the tutorial to scrape and store multiple items into multiple collections for the same mongoDB database and then export it in CSV format, I've figured out how to do most of it but i'm having trouble with the "xpaths" which scrapy uses to search for specified items on the web page, to be more spesific I've figured out how to do the pipleline to mongodb and the storing multiple collections as well as changing the collection names based on the name of the item that is being scraped but I cannot get the "spiders" working specifically the xpaths or to my understanding the problem lies with the xpaths being wrong.
I have no prior experience with scrapy and i've done days of research trying to figure out how to do the xpaths but I can't seem to get it working.
The page i'm trying to scrape : https://stackoverflow.com/
The spider for question titles and urls that is working as intended :
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import QuestionItem
class QuestionSpider(Spider):
name = "questions"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
item = QuestionItem()
item['title'] = question.xpath(
'a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath(
'a[@class="question-hyperlink"]/@href').extract()[0]
yield item
The spider for number of answers, votes and views that is not working as intended :
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import PopularityItem
class PopularitySpider(Spider):
name = "popularity"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
popularity = Selector(response).xpath('//div[@class="summary"]/h3')
for poppart in popularity:
item = PopularityItem()
item['votes'] = poppart.xpath(
'div[contains(@class, "votes")]/text()').extract()
item['answers'] = poppart.xpath(
'div[contains(@class, "answers")]/text()').extract()
item['views'] = poppart.xpath(
'div[contains(@class, "views")]/text()').extract()
yield item
And lastly the third spider which has similar problems as the second spider.
with the second spider I get the following output and data stored to my mongoDB database after starting the spider with :
scrapy crawl popularity
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9410d"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9410e"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9410f"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94110"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94111"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94112"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94113"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94114"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94115"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94116"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94117"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94118"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94119"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411a"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411b"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411c"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411d"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411e"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411f"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94120"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
as you can see all items are empty, the only way I have been able to get some output was with the xpath :
//div[contains(@class, "views")]/text()
To my understanding using "//" means all elements in div where class = "views"
using this method only works partially as I only get output for the views item and all output is stored in one item row, then again for the next loop in the for all output is stored in the next item row which makes sense because i'm using
//div instead of div
This is happening "or I think it is" because of the loop where it loops through the number of "summary" classes on the page as a method for telling the scraper how many rows to scrape and store, this is done with the following xpath and code segment "I did display it above but just for clarity" :
def parse(self, response):
popularity = Selector(response).xpath('//div[@class="summary"]/h3')
for poppart in popularity:
the output i'm given when using
//div
is as follows :
{ "_id" : ObjectId("5bbdf34ab395bb249c3c71c2"), "votes" : [ "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n " ], "answers" : [ ], "views" : [ "\r\n 3 views\r\n", "\r\n 2 views\r\n", "\r\n 4 views\r\n", "\r\n 2 views\r\n", "\r\n 2 views\r\n", "\r\n 2 views\r\n", "\r\n 3 views\r\n", "\r\n 8 views\r\n", "\r\n 3 views\r\n", "\r\n 2 views\r\n", "\r\n 2 views\r\n", "\r\n 4 views\r\n", "\r\n 5 views\r\n", "\r\n 10 views\r\n", "\r\n 5 views\r\n", "\r\n 2 views\r\n", "\r\n 2 views\r\n", "\r\n 3 views\r\n", "\r\n 2 views\r\n", "\r\n 4 views\r\n", "\r\n 14 views\r\n", "\r\n 2 views\r\n", "\r\n 5 views\r\n", "\r\n 3 views\r\n", "\r\n 5 views\r\n", "\r\n 3 views\r\n", "\r\n 6 views\r\n", "\r\n 7 views\r\n", "\r\n 3 views\r\n", "\r\n 7 views\r\n", "\r\n 5 views\r\n", "\r\n 14 views\r\n", "\r\n 4 views\r\n", "\r\n 12 views\r\n", "\r\n 16 views\r\n", "\r\n 7 views\r\n", "\r\n 7 views\r\n", "\r\n 7 views\r\n", "\r\n 4 views\r\n", "\r\n 4 views\r\n", "\r\n 3 views\r\n", "\r\n 2 views\r\n", "\r\n 4 views\r\n", "\r\n 3 views\r\n", "\r\n 3 views\r\n", "\r\n 8 views\r\n", "\r\n 2 views\r\n", "\r\n 10 views\r\n", "\r\n 6 views\r\n", "\r\n 3 views\r\n" ] }
{ "_id" : ObjectId("5bbdf34ab395bb249c3c71c3"), "votes" : [ "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n ", "\r\n " ], "answers" : [ ], "views" : [ "\r\n 3 views\r\n", "\r\n 2 views\r\n", "\r\n 4 views\r\n", "\r\n 2 views\r\n", "\r\n 2 views\r\n", "\r\n 2 views\r\n", "\r\n 3 views\r\n", "\r\n 8 views\r\n", "\r\n 3 views\r\n", "\r\n 2 views\r\n", "\r\n 2 views\r\n", "\r\n 4 views\r\n", "\r\n 5 views\r\n", "\r\n 10 views\r\n", "\r\n 5 views\r\n", "\r\n 2 views\r\n", "\r\n 2 views\r\n", "\r\n 3 views\r\n", "\r\n 2 views\r\n", "\r\n 4 views\r\n", "\r\n 14 views\r\n", "\r\n 2 views\r\n", "\r\n 5 views\r\n", "\r\n 3 views\r\n", "\r\n 5 views\r\n", "\r\n 3 views\r\n", "\r\n 6 views\r\n", "\r\n 7 views\r\n", "\r\n 3 views\r\n", "\r\n 7 views\r\n", "\r\n 5 views\r\n", "\r\n 14 views\r\n", "\r\n 4 views\r\n", "\r\n 12 views\r\n", "\r\n 16 views\r\n", "\r\n 7 views\r\n", "\r\n 7 views\r\n", "\r\n 7 views\r\n", "\r\n 4 views\r\n", "\r\n 4 views\r\n", "\r\n 3 views\r\n", "\r\n 2 views\r\n", "\r\n 4 views\r\n", "\r\n 3 views\r\n", "\r\n 3 views\r\n", "\r\n 8 views\r\n", "\r\n 2 views\r\n", "\r\n 10 views\r\n", "\r\n 6 views\r\n", "\r\n 3 views\r\n" ] }
Type "it" for more
I'm only showing two lines but it does this for the amount of lines specified by the forloop.
To summarize, I believe i'm doing something wrong with my xpaths here. any help would be appreciated as I've spent many days trying to fix this to no success.
I am including my pipline, settings and items for completion.
The Settings:
BOT_NAME = 'stack'
SPIDER_MODULES = ['stack.spiders']
NEWSPIDER_MODULE = 'stack.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'stack (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {'stack.pipelines.MongoDBPipeline': 300}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "testpop13"
The items:
import scrapy
from scrapy.item import Item, Field
class QuestionItem(Item):
title = Field()
url = Field()
class PopularityItem(Item):
votes = Field()
answers = Field()
views = Field()
class ModifiedItem(Item):
lastModified = Field()
modName = Field()
The pipleline:
import pymongo
import logging
class StackPipeline(object):
def process_item(self, item, spider):
return item
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
self.db = connection[settings['MONGODB_DB']]
def process_item(self, item, spider):
collection = self.db[type(item).__name__.lower()]
logging.info(collection.insert(dict(item)))
return item
and lastly how the correct output from the questions spider looks:
> db.questionitem.find()
{ "_id" : ObjectId("5bbdfa29b395bb1c74c9721c"), "title" : "Why I can't enforce EditTextPreference to take just numbers?", "url" : "/questions/52741046/why-i-cant-enforce-edittextpreference-to-take-just-numbers" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9721d"), "title" : "mysql curdate method query is not giving correct result", "url" : "/questions/52741045/mysql-curdate-method-query-is-not-giving-correct-result" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9721e"), "title" : "how to execute FME workbench with parameters in java", "url" : "/questions/52741044/how-to-execute-fme-workbench-with-parameters-in-java" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9721f"), "title" : "create a top 10 list for multiple groups with a ranking in python", "url" : "/questions/52741043/create-a-top-10-list-for-multiple-groups-with-a-ranking-in-python" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97220"), "title" : "Blob binding not working in VS2017 Azure function template", "url" : "/questions/52741041/blob-binding-not-working-in-vs2017-azure-function-template" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97221"), "title" : "How to convert float to vector<unsigned char> in C++?", "url" : "/questions/52741039/how-to-convert-float-to-vectorunsigned-char-in-c" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97222"), "title" : "Nginx serving server and static build", "url" : "/questions/52741038/nginx-serving-server-and-static-build" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97223"), "title" : "Excel Shortout key to format axis bound?", "url" : "/questions/52741031/excel-shortout-key-to-format-axis-bound" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97224"), "title" : "POST successful but the data doesn't appear in the controller", "url" : "/questions/52741029/post-successful-but-the-data-doesnt-appear-in-the-controller" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97225"), "title" : "Node - Nested For loop async behaviour", "url" : "/questions/52741028/node-nested-for-loop-async-behaviour" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97226"), "title" : "KSH Shell script not zipping up files", "url" : "/questions/52741027/ksh-shell-script-not-zipping-up-files" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97227"), "title" : "Property 'replaceReducer' does not exist on type 'Store<State>' After upgrading @ngrx/store", "url" : "/questions/52741023/property-replacereducer-does-not-exist-on-type-storestate-after-upgrading" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97228"), "title" : "passing more than 10 arguments to a shell script within gitlab yaml", "url" : "/questions/52741022/passing-more-than-10-arguments-to-a-shell-script-within-gitlab-yaml" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97229"), "title" : "Setting an environmental variable in a docker-compose.yml file is the same as setting that variable in a .env file?", "url" : "/questions/52741021/setting-an-environmental-variable-in-a-docker-compose-yml-file-is-the-same-as-se" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722a"), "title" : "Pass list of topics from application yml to KafkaListener", "url" : "/questions/52741016/pass-list-of-topics-from-application-yml-to-kafkalistener" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722b"), "title" : "Copy numbers at the beggining of each line to the end of line", "url" : "/questions/52741015/copy-numbers-at-the-beggining-of-each-line-to-the-end-of-line" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722c"), "title" : "Pretty JSON retrieved from response in GoLang", "url" : "/questions/52741013/pretty-json-retrieved-from-response-in-golang" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722d"), "title" : "Swift: Sorting Core Data child entities based on Date in each parent", "url" : "/questions/52741010/swift-sorting-core-data-child-entities-based-on-date-in-each-parent" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722e"), "title" : "How to create Paypal developer account", "url" : "/questions/52741009/how-to-create-paypal-developer-account" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722f"), "title" : "output of the program and explain why a and b showing different values", "url" : "/questions/52741008/output-of-the-program-and-explain-why-a-and-b-showing-different-values" }
Type "it" for more
From this output I can save it to CSV and everything works.
I apologize for the lengthy post, I wanted to be as complete about this as possible if any other information is required please don't hesitate to ask I'll be monitoring this question closely.
Thanks in advance for any help.
` tags of the `questions` xpath. You'll need to pick another xpath for `popularity`. Try `popularity = Selector(response).xpath('//div[@class="statscontainer"]')`
– pwinz Oct 10 '18 at 14:15