*Note : I wrote the code at spyder and ran it at anaconda command prompt with scrapy crawl KMSS
Question A :
I have my import items error here and so far there is no answer : Import Module Error ( I have just added some extra details to the question)
However, the import error does not stop me from running the script at anaconda command prompt ( If I have understood it correctly)
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from crawlKMSS.items import CrawlkmssItem
class KmssSpider(Spider):
name = "KMSS"
allowed_domains = ["~/LotusQuickr/dept"]
loginp = (
'https://~/LotusQuickr/dept/Main.nsf?OpenDatabase&login',
)
start_urls = ('https://~/LotusQuickr/dept/Main.nsf',)
rules = ( Rule(LinkExtractor(),
callback='parse', follow=True),
)
def init_request(self):
"""This function is called before crawling starts."""
return Request(url=self.loginp, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'name': 'username', 'password': 'pw'},
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "what_should_I_put_here" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
self.initialized()
else:
self.log("You are not logged in.")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
hxs = Selector(response)
tabs = [hxs.xpath('//div[@class="lotusBottomCorner"]/span/ul/li |//div[@class="q-otherItem"]/h4 | //div[@class="q-folderItem"]/h4')]
for tab in tabs:
kmTab = CrawlkmssItem()
kmTab['title'] = tab.xpath(
"a[contains(@class,'qtocsprite')]/text()").extract()
kmTab['url'] = tab.xpath(
"a[contains(@class,'qtocsprite')]/@href").extract()
kmTab['fileurl'] = tab.xpath('a/@href').extract()
kmTab['filename'] = tab.xpath('a/text()').extract()
kmTab['folderurl'] = tab.xpath('a/@href').extract()
kmTab['foldername'] = tab.xpath('a/test()').extract()
yield kmTab
I have my first crawling project written as above. My task is to extract information from our company's intranet ( my computer has configured to access the intranet.)
QuestionB :
is it possible to crawl intranet?
The intranet requires authentication except for the loginpage(loginp
)
( I used '~' to hide the actual site as it is not supposed to publish, but all (~)s are identical)
I supplied the log-in activity with function login
, in which I implemented it by referring to previous questions answered in stackoverflow. However, when I have to input the 'if something in response.body"
at function check_login_response
,
QuestionC :
I have no idea what should I input to replace the 'something'
After logging in( where I have no idea how to know it have logged in or not), I should be able to go through every url found from accessing start_urls
and it should keep running through every possible url with linkextractor
under the format mentioned below.
QuestionD :
And since the spider start with
while all the urls follows the format:
https://~/LotusQuickr/dept/... ( some with Main.nsf and some without Main.nsf)
I have to use allow=['']
under Rule
so it could work for urls with the format above. Am I correct? ( which is also listed under allowed_domains
With the selector: I need to extract three types of information.
1)
I need the href
and text()
( if there exists the two elements) for each <li>
under the <div>
of class lotusBottomCorner
2)
I also need the href
and text()
( if there exists the two elements) for each <h4>
under each <td>
and with the class q-folderItem
( if there exists this class)
3)
At last I would need the href
and text()
( if there exists the two elements) for each <h4>
under each <td>
and with the class q-otherItem
( if there exists this class)
QuestionE :
I have tested with my chrome console to make sure they work. However, when I extended the selector
with |
, they no longer work. How should i fix it or restructure it so that I could obtain all three information for every page?
I have my items.py as below:
import scrapy
class CrawlkmssItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
foldername=scrapy.Field()
folderurl=scrapy.Field()
filename=scrapy.Field()
fileurl=scrapy.Field()
Sorry of asking such a lengthy question. I am very new to scrapy and I have already read through several tutorials and documentations. Yet, I still did not manage to implement it.
I really appreciate all the helps!