What I did to achieved this, was by doing the following:
- Observed what after authentication data needed to proceed with the page.
- Using Chrome's developers' tool in the Network tab, I checked the Request Headers. Upon observation, Authorization is needed.
- To verify step #2, I used Postman. Using the Authorization in Postman, Basic Auth type, filling up the username and password will generate the same value for the Authorization header. After sending a POST request, it loaded the desired page and bypassed the authentication.
- Having the same value for the Authorization under Request Headers, store the value in the Scraper class.
- Use the scrapy.Request function with headers parameter.
Code:
import scrapy
class TestScraper(scrapy.Spider):
handle_httpstatus_list = [401]
name = "Test"
allowed_domains = ["xxx.xx.xx"]
start_urls = ["http://testdomain/test"]
auth = "Basic [Key Here]"
def parse(self, response):
return scrapy.Request(
"http://testdomain/test",
headers={'Authorization': self.auth},
callback=self.after_login
)
def after_login(self, response):
self.log(response.body)
Now, you can crawl the page after authentication process.