10

I want to use scrapy shell and test response data for url which requires basic auth credentials. I tried to check scrapy shell documentation but I couldn't find it there.

I tried with scrapy shell 'http://user:pwd@abc.com' but it didn't work. Does anybody know how I can achieve it?

Rohanil
  • 1,717
  • 5
  • 22
  • 47
  • could you share how are you logging in inside a spider? – eLRuLL Mar 16 '17 at 02:33
  • I am using [HttpAuthMiddleware](https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware) in spider but I want to use shell instead of spider. – Rohanil Mar 16 '17 at 02:42
  • it will work so long as you run the shell command from your project directory. Also with the middleware you don't need the `user:password` in the url, the middle ware handles that for you – Verbal_Kint Mar 16 '17 at 02:47

2 Answers2

25

if you want to use only the shell, you could do something like this:

$ scrapy shell

and inside the shell:

>> from w3lib.http import basic_auth_header
>> from scrapy import Request
>> auth = basic_auth_header(your_user, your_password)
>> req = Request(url="http://example.com", headers={'Authorization': auth})
>> fetch(req)

as fetch uses the current request to update the shell session.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • to be honest I would say your idea to add `user:pass` into the url directly on shell looks interesting, I'll try to propose or implement in into `scrapy` – eLRuLL Mar 16 '17 at 03:01
  • 1
    looks like it will be addressed very soon: https://github.com/scrapy/scrapy/pull/1466 – eLRuLL Mar 16 '17 at 03:08
  • you can probably even take this a step further and pass the auth variable in this example in through the settings.py headers section. So that you do not manually have to enter this for every shell section. – Alex Sep 19 '22 at 17:27
6

Yes with httpauth middleware.

Make sure HTTPAuthMiddleware is enabled in the settings then just define:

class MySpider(CrawSpider):
    http_user = 'username'
    http_pass = 'password'
    ...

as class variables in your spider.

Also, you don't need to specify the login credentials in the url if the middleware has been enabled in the settings.

Verbal_Kint
  • 1,366
  • 3
  • 19
  • 35
  • I want to use shell instead of spider – Rohanil Mar 16 '17 at 02:50
  • the shell uses the project resources – Verbal_Kint Mar 16 '17 at 02:51
  • 1
    @Rohanil try `scrapy shell 'http://www.example.org'` and make sure you have included the middleware in your settings along with specifying the login credentials as class variable named as they are in my example – Verbal_Kint Mar 16 '17 at 02:53
  • I think it would need the domain to be included in `allowed_domains` for the shell to use this Spider implementation – eLRuLL Mar 16 '17 at 13:45
  • @eLRuLL my experience with scrapy shell has been that all the setting are loaded from the project root you are running the shell from, so all the middlewares, etc. will be used – Verbal_Kint Mar 16 '17 at 19:55
  • yeah but how do you match the url in the shell with which spider to use (assuming you have multiple spiders in your project)? the `allowed_domains` helps with that – eLRuLL Mar 16 '17 at 19:57
  • @eLRuLL My understanding is when using the fetch command with a url or request no spider is specifically used, but the request does get processed by middlewares and other features enabled in the settings. So you would have to specifically instantiate a spider in the shell if you wanted any specific functionality from that spider. – Verbal_Kint Mar 16 '17 at 20:04
  • @eLRuLL if he was just exploring responses from that address in the shell then your solution is the most convenient because you don't have to be bothered with the spider. However, if you're testing/developing a specific spider, then instantiating it under the same conditions of your intended use case would be the way to go. – Verbal_Kint Mar 16 '17 at 20:06