2

Is it possible for Scrapy to crawl an alert message?

The link for example, http://domainhere/admin, once loaded in an actual browser, an alert message with form is present to fill up the username and password.

Or is there a way to inspect the form in an alert message to know what parameters to be filled up?

PS: I do have credentials for this website, I just want to automate processes through web crawling.

Thanks.

BLNK
  • 123
  • 1
  • 12
  • Here is an example https://github.com/nabinkhadka/movies-details-scraper – Nabin Nov 09 '17 at 04:27
  • @Fan_of_Martijn_Pieters thanks for this one. I checked the code and it was helpful but the next problem is how for me to view the DOM of the alert msg. Once I proceed with the link while not logged in, the alert message shows up else error message in the body. – BLNK Nov 09 '17 at 05:35
  • I need to get the selector for the content in the alert message. Then I'll be able to get the names declared in the inputs as parameters for the form. – BLNK Nov 09 '17 at 05:36

1 Answers1

1

What I did to achieved this, was by doing the following:

  1. Observed what after authentication data needed to proceed with the page.
  2. Using Chrome's developers' tool in the Network tab, I checked the Request Headers. Upon observation, Authorization is needed.
  3. To verify step #2, I used Postman. Using the Authorization in Postman, Basic Auth type, filling up the username and password will generate the same value for the Authorization header. After sending a POST request, it loaded the desired page and bypassed the authentication.
  4. Having the same value for the Authorization under Request Headers, store the value in the Scraper class.
  5. Use the scrapy.Request function with headers parameter.

Code:

import scrapy

class TestScraper(scrapy.Spider):
    handle_httpstatus_list = [401]
    name = "Test"
    allowed_domains = ["xxx.xx.xx"]
    start_urls = ["http://testdomain/test"]

    auth = "Basic [Key Here]"

    def parse(self, response):
        return scrapy.Request(
            "http://testdomain/test",
            headers={'Authorization': self.auth},
            callback=self.after_login
        )

    def after_login(self, response):
        self.log(response.body)

Now, you can crawl the page after authentication process.

BLNK
  • 123
  • 1
  • 12