2

I've been running my Scrapy project with a couple of accounts (the project scrapes a especific site that requieres login credentials), but no matter the parameters I set, it always runs with the same ones (same credentials).

I'm running under virtualenv. Is there a variable or setting I'm missing?

Edit:

It seems that this problem is Twisted related.

Even when I run:

scrapy crawl -a user='user' -a password='pass' -o items.json -t json SpiderName

I still get an error saying:

ERROR: twisted.internet.error.ReactorNotRestartable

And all the information I get, is the last 'succesful' run of the spider.

Jean Ventura
  • 27
  • 10

2 Answers2

2

You should check your spider's __init__ method, you should pass there username and password if it's not there. Like that:

class MySpider(BaseSpider):
name = 'myspider'

def __init__(self, username=None, password=None, *args, **kwargs):
    super(MySpider, self).__init__(*args, **kwargs)
    self.start_urls = ['http://www.example.com/']
    self.username = username
    self.password = password

def start_requests(self):
    return [FormRequest("http://www.example.com/login",
                    formdata={'user': self.username, 'pass': self.password,
                    callback=self.logged_in)]
def logged_in(self, response):
    # here you would extract links to follow and return Requests for
    # each of them, with another callback
    pass

Run it:

scrapy crawl myspider -a username=yourname password=yourpass

Code adapted from: http://doc.scrapy.org/en/0.18/topics/spiders.html

EDIT: You can have only one Twisted reactor. But you can use run multiple spiders in the same process with different credentials. Example of running multiple spiders: http://doc.scrapy.org/en/0.18/topics/practices.html#running-multiple-spiders-in-the-same-process

nickzam
  • 793
  • 4
  • 8
  • Tried what you suggested and nothing. This is what I had before: def __init__(self, *args, **kwargs): self.user = kwargs.pop('user', None) self.password = kwargs.pop('password', None) – Jean Ventura Oct 30 '13 at 19:12
  • Error "Reactor Not Restartable" is described here http://stackoverflow.com/questions/7993680/running-scrapy-tasks-in-python – nickzam Oct 30 '13 at 22:40
  • I think it has something to do with that, but not exactly. I'm only calling the reactor once in my script, but after a couple of runs, the reactor seems to stay open, and all further request go through it. Can't find a way to avoid this yet. I might have to try going twisted-less. – Jean Ventura Oct 31 '13 at 02:07
  • About your edit, that's the same script I'm using that presented the reactor error in the first place. – Jean Ventura Oct 31 '13 at 03:22
  • Do you pass username and password as to Spider instances? – nickzam Oct 31 '13 at 11:40
  • Yes I do. I think I'll opt for the CrawlerProcess route, as soon as I figure out how to make it work. – Jean Ventura Oct 31 '13 at 12:49
  • Found the problem, added the answer. Will accept as soon as SO allows it. – Jean Ventura Nov 01 '13 at 13:27
1

Found the problem. My project tree was 'dirty'.

After another developer changed the name of the file that contained the spider code and I updated my local repo with those changes, this only deleted the .py version of the old file and left the .pyc file (cause of .hgignore). This was making Scrapy find the same spider module twice (since the same spider was under two different files), and calling them both under the same Twisted reactor.

After deleting the offending file everything is back to normal.

Jean Ventura
  • 27
  • 10