XPaths with Scrapy and Python, can't get the XPaths working correctly

Question

I have been working with python and scrapy for the past week using the following tutorial: https://realpython.com/web-scraping-with-scrapy-and-mongodb/

What this tutorial does is it goes through scraping the top questions and their urls on stackoverflow with the scrapy web crawler then stores it into a mongoDB database and collection.

I'm trying to adapt what has been done in the tutorial to scrape and store multiple items into multiple collections for the same mongoDB database and then export it in CSV format, I've figured out how to do most of it but i'm having trouble with the "xpaths" which scrapy uses to search for specified items on the web page, to be more spesific I've figured out how to do the pipleline to mongodb and the storing multiple collections as well as changing the collection names based on the name of the item that is being scraped but I cannot get the "spiders" working specifically the xpaths or to my understanding the problem lies with the xpaths being wrong.

I have no prior experience with scrapy and i've done days of research trying to figure out how to do the xpaths but I can't seem to get it working.

The page i'm trying to scrape : https://stackoverflow.com/

The spider for question titles and urls that is working as intended :

from scrapy import Spider
from scrapy.selector import Selector

from stack.items import QuestionItem

class QuestionSpider(Spider):
    name = "questions"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
        questions =     Selector(response).xpath('//div[@class="summary"]/h3')

        for question in questions:
            item = QuestionItem()
            item['title'] = question.xpath(
                'a[@class="question-hyperlink"]/text()').extract()[0]
            item['url'] = question.xpath(
                'a[@class="question-hyperlink"]/@href').extract()[0]
            yield item

The spider for number of answers, votes and views that is not working as intended :

from scrapy import Spider
from scrapy.selector import Selector

from stack.items import PopularityItem

class PopularitySpider(Spider):
    name = "popularity"
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?pagesize=50&sort=newest",
    ]

    def parse(self, response):
        popularity =     Selector(response).xpath('//div[@class="summary"]/h3')


        for poppart in popularity:
           item = PopularityItem()
           item['votes'] = poppart.xpath(
                'div[contains(@class, "votes")]/text()').extract()
           item['answers'] = poppart.xpath(
                'div[contains(@class, "answers")]/text()').extract()
           item['views'] = poppart.xpath(
                'div[contains(@class, "views")]/text()').extract()
           yield item

And lastly the third spider which has similar problems as the second spider.

with the second spider I get the following output and data stored to my mongoDB database after starting the spider with :

scrapy crawl popularity

{ "_id" : ObjectId("5bbde11cb395bb1dc0d9410d"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9410e"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9410f"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94110"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94111"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94112"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94113"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94114"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94115"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94116"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94117"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94118"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94119"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411a"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411b"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411c"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411d"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411e"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d9411f"), "votes" : [ ], "answers" : [ ], "views" : [ ] }
{ "_id" : ObjectId("5bbde11cb395bb1dc0d94120"), "votes" : [ ], "answers" : [ ], "views" : [ ] }

as you can see all items are empty, the only way I have been able to get some output was with the xpath :

//div[contains(@class, "views")]/text()

To my understanding using "//" means all elements in div where class = "views"

using this method only works partially as I only get output for the views item and all output is stored in one item row, then again for the next loop in the for all output is stored in the next item row which makes sense because i'm using

//div instead of div

This is happening "or I think it is" because of the loop where it loops through the number of "summary" classes on the page as a method for telling the scraper how many rows to scrape and store, this is done with the following xpath and code segment "I did display it above but just for clarity" :

    def parse(self, response):
        popularity =     Selector(response).xpath('//div[@class="summary"]/h3')


    for poppart in popularity:

the output i'm given when using

//div

is as follows :

{ "_id" : ObjectId("5bbdf34ab395bb249c3c71c2"), "votes" : [ "\r\n                        ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                " ], "answers" : [ ], "views" : [ "\r\n    3 views\r\n", "\r\n    2 views\r\n", "\r\n    4 views\r\n", "\r\n    2 views\r\n", "\r\n    2 views\r\n", "\r\n    2 views\r\n", "\r\n    3 views\r\n", "\r\n    8 views\r\n", "\r\n    3 views\r\n", "\r\n    2 views\r\n", "\r\n    2 views\r\n", "\r\n    4 views\r\n", "\r\n    5 views\r\n", "\r\n    10 views\r\n", "\r\n    5 views\r\n", "\r\n    2 views\r\n", "\r\n    2 views\r\n", "\r\n    3 views\r\n", "\r\n    2 views\r\n", "\r\n    4 views\r\n", "\r\n    14 views\r\n", "\r\n    2 views\r\n", "\r\n    5 views\r\n", "\r\n    3 views\r\n", "\r\n    5 views\r\n", "\r\n    3 views\r\n", "\r\n    6 views\r\n", "\r\n    7 views\r\n", "\r\n    3 views\r\n", "\r\n    7 views\r\n", "\r\n    5 views\r\n", "\r\n    14 views\r\n", "\r\n    4 views\r\n", "\r\n    12 views\r\n", "\r\n    16 views\r\n", "\r\n    7 views\r\n", "\r\n    7 views\r\n", "\r\n    7 views\r\n", "\r\n    4 views\r\n", "\r\n    4 views\r\n", "\r\n    3 views\r\n", "\r\n    2 views\r\n", "\r\n    4 views\r\n", "\r\n    3 views\r\n", "\r\n    3 views\r\n", "\r\n    8 views\r\n", "\r\n    2 views\r\n", "\r\n    10 views\r\n", "\r\n    6 views\r\n", "\r\n    3 views\r\n" ] }
{ "_id" : ObjectId("5bbdf34ab395bb249c3c71c3"), "votes" : [ "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                ", "\r\n                    ", "\r\n                    ", "\r\n                " ], "answers" : [ ], "views" : [ "\r\n    3 views\r\n", "\r\n    2 views\r\n", "\r\n    4 views\r\n", "\r\n    2 views\r\n", "\r\n    2 views\r\n", "\r\n    2 views\r\n", "\r\n    3 views\r\n", "\r\n    8 views\r\n", "\r\n    3 views\r\n", "\r\n    2 views\r\n", "\r\n    2 views\r\n", "\r\n    4 views\r\n", "\r\n    5 views\r\n", "\r\n    10 views\r\n", "\r\n    5 views\r\n", "\r\n    2 views\r\n", "\r\n    2 views\r\n", "\r\n    3 views\r\n", "\r\n    2 views\r\n", "\r\n    4 views\r\n", "\r\n    14 views\r\n", "\r\n    2 views\r\n", "\r\n    5 views\r\n", "\r\n    3 views\r\n", "\r\n    5 views\r\n", "\r\n    3 views\r\n", "\r\n    6 views\r\n", "\r\n    7 views\r\n", "\r\n    3 views\r\n", "\r\n    7 views\r\n", "\r\n    5 views\r\n", "\r\n    14 views\r\n", "\r\n    4 views\r\n", "\r\n    12 views\r\n", "\r\n    16 views\r\n", "\r\n    7 views\r\n", "\r\n    7 views\r\n", "\r\n    7 views\r\n", "\r\n    4 views\r\n", "\r\n    4 views\r\n", "\r\n    3 views\r\n", "\r\n    2 views\r\n", "\r\n    4 views\r\n", "\r\n    3 views\r\n", "\r\n    3 views\r\n", "\r\n    8 views\r\n", "\r\n    2 views\r\n", "\r\n    10 views\r\n", "\r\n    6 views\r\n", "\r\n    3 views\r\n" ] }

Type "it" for more

I'm only showing two lines but it does this for the amount of lines specified by the forloop.

To summarize, I believe i'm doing something wrong with my xpaths here. any help would be appreciated as I've spent many days trying to fix this to no success.

I am including my pipline, settings and items for completion.

The Settings:

BOT_NAME = 'stack'

SPIDER_MODULES = ['stack.spiders']
NEWSPIDER_MODULE = 'stack.spiders'


# Crawl responsibly by identifying yourself (and your website) on the     user-agent
#USER_AGENT = 'stack (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {'stack.pipelines.MongoDBPipeline': 300}

MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "testpop13"

The items:

import scrapy


from scrapy.item import Item, Field



class QuestionItem(Item):
    title = Field()
    url = Field()

class PopularityItem(Item):
    votes = Field()
    answers = Field()
    views = Field()


class ModifiedItem(Item):
    lastModified = Field()
    modName = Field()

The pipleline:

import pymongo
import logging

class StackPipeline(object):
    def process_item(self, item, spider):
        return item



from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log

class MongoDBPipeline(object):


    def __init__(self):

    connection = pymongo.MongoClient(settings['MONGODB_SERVER'],     settings['MONGODB_PORT'])
        self.db = connection[settings['MONGODB_DB']]

    def process_item(self, item, spider):
        collection = self.db[type(item).__name__.lower()]
        logging.info(collection.insert(dict(item)))
        return item

and lastly how the correct output from the questions spider looks:

> db.questionitem.find()
{ "_id" : ObjectId("5bbdfa29b395bb1c74c9721c"), "title" : "Why I can't enforce EditTextPreference to take just numbers?", "url" : "/questions/52741046/why-i-cant-enforce-edittextpreference-to-take-just-numbers" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9721d"), "title" : "mysql curdate method query is not giving correct result", "url" : "/questions/52741045/mysql-curdate-method-query-is-not-giving-correct-result" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9721e"), "title" : "how to execute FME workbench with parameters in java", "url" : "/questions/52741044/how-to-execute-fme-workbench-with-parameters-in-java" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9721f"), "title" : "create a top 10 list for multiple groups with a ranking in python", "url" : "/questions/52741043/create-a-top-10-list-for-multiple-groups-with-a-ranking-in-python" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97220"), "title" : "Blob binding not working in VS2017 Azure function template", "url" : "/questions/52741041/blob-binding-not-working-in-vs2017-azure-function-template" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97221"), "title" : "How to convert float to vector<unsigned char> in C++?", "url" : "/questions/52741039/how-to-convert-float-to-vectorunsigned-char-in-c" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97222"), "title" : "Nginx serving server and static build", "url" : "/questions/52741038/nginx-serving-server-and-static-build" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97223"), "title" : "Excel Shortout key to format axis bound?", "url" : "/questions/52741031/excel-shortout-key-to-format-axis-bound" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97224"), "title" : "POST successful but the data doesn't appear in the controller", "url" : "/questions/52741029/post-successful-but-the-data-doesnt-appear-in-the-controller" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97225"), "title" : "Node - Nested For loop async behaviour", "url" : "/questions/52741028/node-nested-for-loop-async-behaviour" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97226"), "title" : "KSH Shell script not zipping up files", "url" : "/questions/52741027/ksh-shell-script-not-zipping-up-files" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97227"), "title" : "Property 'replaceReducer' does not exist on type 'Store<State>' After upgrading @ngrx/store", "url" : "/questions/52741023/property-replacereducer-does-not-exist-on-type-storestate-after-upgrading" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97228"), "title" : "passing more than 10 arguments to a shell script within gitlab yaml", "url" : "/questions/52741022/passing-more-than-10-arguments-to-a-shell-script-within-gitlab-yaml" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c97229"), "title" : "Setting an environmental variable in a docker-compose.yml file is the same as setting that variable in a .env file?", "url" : "/questions/52741021/setting-an-environmental-variable-in-a-docker-compose-yml-file-is-the-same-as-se" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722a"), "title" : "Pass list of topics from application yml to KafkaListener", "url" : "/questions/52741016/pass-list-of-topics-from-application-yml-to-kafkalistener" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722b"), "title" : "Copy numbers at the beggining of each line to the end of line", "url" : "/questions/52741015/copy-numbers-at-the-beggining-of-each-line-to-the-end-of-line" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722c"), "title" : "Pretty JSON retrieved from response in GoLang", "url" : "/questions/52741013/pretty-json-retrieved-from-response-in-golang" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722d"), "title" : "Swift: Sorting Core Data child entities based on Date in each parent", "url" : "/questions/52741010/swift-sorting-core-data-child-entities-based-on-date-in-each-parent" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722e"), "title" : "How to create Paypal developer account", "url" : "/questions/52741009/how-to-create-paypal-developer-account" }
{ "_id" : ObjectId("5bbdfa2ab395bb1c74c9722f"), "title" : "output of the program and explain why a and b showing different values", "url" : "/questions/52741008/output-of-the-program-and-explain-why-a-and-b-showing-different-values" }
Type "it" for more

From this output I can save it to CSV and everything works.

I apologize for the lengthy post, I wanted to be as complete about this as possible if any other information is required please don't hesitate to ask I'll be monitoring this question closely.

Thanks in advance for any help.

Do you really think that all provided info is required to reproduce XPath issue? — Andersson, Oct 10 '18 at 13:26
I just wanted to create a clear image of the issue and the results if it's overboard I'll take that into consideration and not overdo future questions. — R.L Le Hanie, Oct 10 '18 at 13:30
The `popularity` selector uses the exact same xpath as the `questions` selector. The votes you want for `popularity` aren't inside the `
` tags of the `questions` xpath. You'll need to pick another xpath for `popularity`. Try `popularity = Selector(response).xpath('//div[@class="statscontainer"]')` — pwinz, Oct 10 '18 at 14:15

score 2 · Accepted Answer · answered Oct 10 '18 at 14:19

2

Firstly, you dont need to write

Selector(response).xpath(...)

Instead you can just write

response.xpath(...)

Secondly, In the PopularitySpider you need to change the selectors as:-

popularity = response.xpath('//div[contains(@class, "question-summary")]')

for poppart in popularity:
       item = PopularityItem()
       item['votes'] = poppart.xpath(
            '//div[contains(@class, "votes")]//span/text()').extract()
       item['answers'] = poppart.xpath(
            '//div[contains(@class, "answered")]//span/text()').extract()
       item['views'] = poppart.xpath(
            '//div[contains(@class, "views")]//span/text()').extract()
       yield item

Hope it solves your problem.

Note: The good practice to check whether your selector is working or not is to use inspect option in browser (chrome etc), even better option would be to test your code in scrapy shell. You just need to run the following command:-

scrapy shell www.example.com

answered Oct 10 '18 at 14:19

Ahsan Malik

156
3

Thank you for the shell suggestion, I tried using the shell and it works perfectly when using the shell, however when implementing the code all outputs are empty such as in the example I provided above, this must mean that there is a problem with something other than the xpaths as it is working in the shell, I also tried using it without the loop and just storing into one line and item but still the same results. – R.L Le Hanie Oct 10 '18 at 14:58
I found the problem, my start url was set to the wrong url it is mostly working as intended, the only problem being that all the scraped info is stored in one ID at a time e.g. Votes: 2, 4, 5, Views: 3, 6, 7 and Answers: 2, 7, 3 Instead of Votes: 2, Views:4, Answers:5 per ID, this is most probably due to using "//div", any suggestions for an alternate method ? I tried using just "div" but it did not work. – R.L Le Hanie Oct 10 '18 at 15:12
To clarify it stores into 3 columns and 1 row instead of 3 columns and 30 rows which is what the question spider does. – R.L Le Hanie Oct 10 '18 at 15:34
1

Nearly got it working with the following changes : //div[contains(@class, "question-summary narrow")]/div' and div[contains(@class, "votes")]//span/text()', I removed "//" infront of the div inside the loop and put "/div" at the end of the xpath before the loop, it's almost perfect but leaves every second row blank. I saw a solution to this somewhere else so i'll mark this question as solved. – R.L Le Hanie Oct 10 '18 at 15:48

XPaths with Scrapy and Python, can't get the XPaths working correctly

` tags of the `questions` xpath. You'll need to pick another xpath for `popularity`. Try `popularity = Selector(response).xpath('//div[@class="statscontainer"]')`

1 Answers1