Writing crawler that stay logged in with any server

Question

I am writing a crawler. Once after the crawler logs into a website I want to make the crawler to "stay-always-logged-in". How can I do that? Is a client (like browser, crawler etc.,) make a server to obey this rule? This scenario could occur when the server allows limited logins in day.

score 5 · Accepted Answer · edited May 23 '17 at 12:07

5

"Logged-in state" is usually represented by cookies. So what your have to do is to store the cookie information sent by that server on login, then send that cookie with each of your subsequent requests (as noted by Aiden Bell in his message, thx).

See also this question:

How to "keep-alive" with cookielib and httplib in python?

A more comprehensive article on how to implement it:

http://www.voidspace.org.uk/python/articles/cookielib.shtml

The simplest examples are at the bottom of this manual page:

https://docs.python.org/library/cookielib.html

You can also use a regular browser (like Firefox) to log in manually. Then you'll be able to save the cookie from that browser and use that in your crawler. But such cookies are usually valid only for a limited time, so it is not a long-term fully automated solution. It can be quite handy for downloading contents from a Web site once, however.

UPDATE:

I've just found another interesting tool in a recent question:

http://www.scrapy.org

It can also do such cookie based login:

http://doc.scrapy.org/topics/request-response.html#topics-request-response-ref-request-userlogin

The question I mentioned is here:

Scrapy domain_name for spider

Hope this helps.

edited May 23 '17 at 12:07

Community

1
1

answered Nov 26 '09 at 15:23

fviktor

2,861
20
24

1

Also, he might have to add sporadic activity to the session to stop it expiring. – Aiden Bell Nov 26 '09 at 15:26
The session can expire due to a server side "limit" on session lifetime, even if you add sporadic activity. So the long term solution is to allow the crawler to log in if needed. But using a cookie saved from a browser after logging in manually and keeping it alive is simpler, indeed, as long as the server allows sessions of (essentially) unlimited lifetime. – fviktor Nov 26 '09 at 16:40
@fvivtor - How to know server allows sessions of unlimited lifetime? Are you referring to "Keep-alive" header? Can you be little more specific – asyncwait Nov 26 '09 at 17:23
@Aiden Bell -- Can you explain the "sproadic activity"? – asyncwait Nov 26 '09 at 17:24
I think no way to figure it out. Since the server can delete the server side session information even before the cookie expires in your browser. This deletion can be prevented by that sporadic activity. I think Aiden Bell meant periodic dummy requests to the given server even while your crawler is IDLE. – fviktor Nov 27 '09 at 04:13
Cookies also have a lifetime at the client side, but if you keep the cookie forever in Python, then that lifetime does not count any more. – fviktor Nov 27 '09 at 04:14

Writing crawler that stay logged in with any server

1 Answers1