Scrapy xpath and import items

Question

*Note : I wrote the code at spyder and ran it at anaconda command prompt with scrapy crawl KMSS

Question A :

I have my import items error here and so far there is no answer : Import Module Error ( I have just added some extra details to the question)

However, the import error does not stop me from running the script at anaconda command prompt ( If I have understood it correctly)

from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from crawlKMSS.items import CrawlkmssItem

class KmssSpider(Spider):
    name = "KMSS"
    allowed_domains = ["~/LotusQuickr/dept"]
    loginp = (
        'https://~/LotusQuickr/dept/Main.nsf?OpenDatabase&login',
    )
    start_urls = ('https://~/LotusQuickr/dept/Main.nsf',)

    rules = ( Rule(LinkExtractor(),
             callback='parse', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.loginp, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'username', 'password': 'pw'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "what_should_I_put_here" in response.body:
            self.log("Successfully logged in. Let's start crawling!")

            # Now the crawling can begin..
            self.initialized()
        else:
            self.log("You are not logged in.")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        hxs = Selector(response)
        tabs = [hxs.xpath('//div[@class="lotusBottomCorner"]/span/ul/li |//div[@class="q-otherItem"]/h4 | //div[@class="q-folderItem"]/h4')]      
        for tab in tabs:
            kmTab = CrawlkmssItem()
            kmTab['title'] = tab.xpath(
                "a[contains(@class,'qtocsprite')]/text()").extract()
            kmTab['url'] = tab.xpath(
                "a[contains(@class,'qtocsprite')]/@href").extract()  
            kmTab['fileurl'] = tab.xpath('a/@href').extract()
            kmTab['filename'] = tab.xpath('a/text()').extract()
            kmTab['folderurl'] = tab.xpath('a/@href').extract()
            kmTab['foldername'] = tab.xpath('a/test()').extract()
            yield kmTab

I have my first crawling project written as above. My task is to extract information from our company's intranet ( my computer has configured to access the intranet.)

QuestionB :

is it possible to crawl intranet?

The intranet requires authentication except for the loginpage(loginp)

( I used '~' to hide the actual site as it is not supposed to publish, but all (~)s are identical)

I supplied the log-in activity with function login, in which I implemented it by referring to previous questions answered in stackoverflow. However, when I have to input the 'if something in response.body" at function check_login_response,

QuestionC :

I have no idea what should I input to replace the 'something'

After logging in( where I have no idea how to know it have logged in or not), I should be able to go through every url found from accessing start_urls and it should keep running through every possible url with linkextractor under the format mentioned below.

QuestionD :

And since the spider start with

https://~/LotusQuickr/dept/Main.nsf

while all the urls follows the format:

https://~/LotusQuickr/dept/... ( some with Main.nsf and some without Main.nsf)

I have to use allow=[''] under Rule so it could work for urls with the format above. Am I correct? ( which is also listed under allowed_domains

With the selector: I need to extract three types of information.

1)

I need the href and text() ( if there exists the two elements) for each <li> under the <div> of class lotusBottomCorner under the class : lotusBottomCorner

2)

enter image description here

I also need the href and text() ( if there exists the two elements) for each <h4> under each <td> and with the class q-folderItem( if there exists this class)

3)

enter image description here

At last I would need the href and text() ( if there exists the two elements) for each <h4> under each <td> and with the class q-otherItem( if there exists this class)

QuestionE :

I have tested with my chrome console to make sure they work. However, when I extended the selector with |, they no longer work. How should i fix it or restructure it so that I could obtain all three information for every page?

I have my items.py as below:

import scrapy

class CrawlkmssItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    foldername=scrapy.Field()
    folderurl=scrapy.Field()
    filename=scrapy.Field()
    fileurl=scrapy.Field()

Sorry of asking such a lengthy question. I am very new to scrapy and I have already read through several tutorials and documentations. Yet, I still did not manage to implement it.

I really appreciate all the helps!

have you tried using scrapy shell and testing out the Xpaths? cmd->project-Folder-> `scrapy shell {URL}` — Smashed, Aug 08 '15 at 20:53

Scrapy xpath and import items

0 Answers0