4

I'm using a Python script (Mechanize) to login to a proxy portal. I can login successfully. I can check that from read() function.

However, after successful login, I couldn't access the blocked sites by the proxy. So I checked the HTTP headers from FF and found that Connection: Keep-alive. But from mechanize, I found Connection: close. I tried to imitate the HTTP header exactly as from FF using browser.addheaders but this didn't work as well :(

After deep digging, I found a couple of suggestions that the server closes the connection because mechanize can't totally emulate a browser as the webpage contains JS which is not supported by mechanize

So, is there a way to emulate (make the server feel) that mechanize is a browser (supports JS), even though it doesn't?

BTW, I don't need JS, I can login successfully as I mentioned above. And please don't suggest PhantomJS. I need a Python package to do the job not a headless browser.

Update:

FireFox Headers:

GET xxx HTTP/1.1
Host: xxx
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: DSLastAccess=1454082611
Connection: keep-alive


HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Set-Cookie: DSEPAgentInstalled=; path=/; expires=Tue, 31-Jan-2006 16:18:32 GMT; secure
Date: Fri, 29 Jan 2016 16:18:32 GMT
x-frame-options: SAMEORIGIN
Connection: Keep-Alive
Keep-Alive: timeout=15
Pragma: no-cache
Cache-Control: no-store
Expires: -1
Transfer-Encoding: chunked

Mechanize addheaders:

browser.addheaders = [('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),\
            ('Accept-Language', 'en-US,en;q=0.5'),\
            ('Accept-Encoding', 'gzip, deflate'),\
            ('Host', 'xxx.net'),\
            ('Connection','keep-alive'),\
            ('Cookie', 'DSLastAccess=1454082611'),\
            ('User-agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0')]

Mechanize Headers

send: 'CONNECT xxx.net:443 HTTP/1.0\r\n'
send: '\r\n'
send: 'GET xxx.cgi HTTP/1.1\r\nAccept-Language: en-US,en;q=0.5\r\nAccept-Encoding: gzip, deflate\r\nHost: xxx.net\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0\r\nConnection: close\r\nCookie: DSLastAccess=1454082611\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Content-Type: text/html; charset=utf-8
header: Set-Cookie: DSEPAgentInstalled=; path=/; expires=Tue, 31-Jan-2006 16:31:03 GMT; secure
header: Date: Fri, 29 Jan 2016 16:31:03 GMT
header: x-frame-options: SAMEORIGIN
header: Connection: close
header: Pragma: no-cache
header: Cache-Control: no-store
header: Expires: -1

Another thing that drives me crazy, that the sent Connection from mechanize is : close even though I've set it as keep-alive as you can see in addheaders

Mogsdad
  • 44,709
  • 21
  • 151
  • 275
  • 1
    There is nothing in HTTP headers about JS. Keep-alive is probably not relevant here. You should probably post the HTTP headers (both request and response) in both working and not working version. Edit out the session cookie or whatever, but check if it was there. – Sergey Salnikov Jan 27 '16 at 17:41
  • @SergeySalnikov, thanks for the reply. I'm not saying that there is something in HTTP headers about JS. I'm just saying that from the HTTP headers I can tell that the server closes the connection. And that's, probably, because the server can tell that `mechanize` is not a browser. And it can tell because it doesn't see support for JS. So it recognizes `mechanize` as NOT a browser –  Jan 28 '16 at 13:32
  • Do you mean the server closes the connection without any reply? – Sergey Salnikov Jan 28 '16 at 15:27
  • @SergeySalnikov, no of course it replis. I mean when I check the server HTTP header it has `Connection: close` –  Jan 29 '16 at 15:15
  • As far as I know, there's no way a HTTP server detect client javascript support. The most common way to detect client is by User-Agent header property. It would be great if you post request/response headers, as suggested by @SergeySalnikov – Miguel A. Baldi Hörlle Jan 29 '16 at 15:33
  • Miguel A. Baldi Hörlle, kindly check the update –  Jan 29 '16 at 16:35
  • I looked and mechanize [seems to not support](https://github.com/jjlee/mechanize/blob/master/mechanize/_urllib2_fork.py#L1092) persistent connections. But this should be at most a performance problem, otherwise both responses look fine. Where's the problem? – Sergey Salnikov Jan 30 '16 at 18:37
  • @SergeySalnikov The problem is that I'm using it to connect to a proxy portal. So if the connection is not persistent, I can not proceed to open the sites blocked by the proxy. –  Jan 31 '16 at 09:47
  • I have a mechanize script that logs into a site and then proceeds to query it. This works with the `Keep-alive` header set to `close`. – Jared Goguen Feb 01 '16 at 23:15
  • @o_o yeah me too. I mean after I login to this portal, I can set `sleep` for some time. Then I can `read` and I find that I'm still login. But the thing is the blocked websites doesn't recognize this login (even though it recognizes any other browser like FF, chrome, IE even if the login is not on the same browser). Therefore, I assumed that the problem is due to this `Connection:` set to `close` –  Feb 03 '16 at 08:23

1 Answers1

7

For linux

Foremost, I know some people dont just wanta suggestion to switch to another option. However, I believe that if you want to access the page entirely after logging in, (which currently fails due to no javascript support) you should look into using Selenium.

You can grab it with a quick sudo pip install selenium.

Accessing a webpage is as easy as declaring your browser, then telling your browser to go to the desired webpage. Here, i have attached a basic sample to make your browser go to a webpage, the page im using relies heavily on javascript:

import selenium
from selenium import webdriver

try:
    browser = webdriver.Firefox()
    browser.get('mikekus.com')
except KeyboardInterrupt:
    browser.quit()

This works, because selenium actually opens a browser. However, if you wish to hide the browser, so you dont have to see it and have it in your taskbar.

I recommend the following setup using pyvirtualdisplay which will hide the browser using visible=0. It is worth noting pyvirtualdisplay is a wrapper, for Xvfb and as such requires you install it as well. You can get it with sudo apt-get install xvfb:

import selenium
from selenium import webdriver
from pyvirtualdisplay import Display


try:
    display = Display(visible=0, size=(800, 600))
    display.start()
    browser = webdriver.Firefox()
    browser.get('mikekus.com')

except KeyboardInterrupt:
    browser.quit()
    display.stop()

I will leave the filling in login forms, etc. To you, as its quite simple if your read the docs, as everyone should. Navigating With Selenium

Granted, in your situation you are trying to access the proxy, then access another site. This method implies you would direct the proxy to the webpage from the proxys page itself, through accessing fields on the page. Im sure with a bit of time you could continue navigating to multiple pages and page elements, again with a bit of research.

I hope this helps. Good luck.

Community
  • 1
  • 1
Colabambino
  • 504
  • 1
  • 4
  • 11
  • thanks for the detailed answer. Actually, I'm already using `Selenium` and `XVFB`. But for some reason, I can't use them for the posted problem. I just need to solve the problem with `mechanize` thanks again –  Jan 29 '16 at 16:09
  • What is keeping you from using them? And if I may, what proxy are you using? Maybe i can do some testing for you – Colabambino Jan 29 '16 at 16:16
  • Upon some further investigation, you cannot access javascript with mechanize, however, as mentioned in (http://stackoverflow.com/questions/802225/how-do-i-use-mechanize-to-process-javascript), if you set your user_agent to an older browser, you may be able to get past but without the javascript. Good luck. – Colabambino Jan 29 '16 at 16:38
  • 1black1, thanks for the answer, looks promising :) I'll try this older browser thing and I'll let you know if it works so you post it as an answer and you get the bounty :) . But this maybe on Monday as I'm leaving office now –  Jan 29 '16 at 16:51
  • 1black1, adding an old UserAgent didn't work. Looks like mechanize can never have a `keep-alive` connection as suggested by @Sergey Salnikov. Then I only left for the option of `Xvfb` –  Feb 03 '16 at 10:03