0

I am using Requests API with Python2.7.

I am trying to download certain webpages through proxy servers. I have a list of available proxy servers. But not all proxy servers work as desired. Some proxies require authentication, others redirect to advertisement pages etc. In order to detect/verify incorrect responses, I have included two checks in my url requests code. It looks similar to this

import requests

proxy = '37.228.111.137:80'
url = 'http://www.google.ca/'
response = requests.get(url, proxies = {'http' : 'http://%s' % proxy})
if response.url != url or response.status_code != 200:
    print 'incorrect response'
else:
    print 'response correct'
    print response.text

There are some proxy servers with which the requests.get call is successful and they pass these two conditions and still contain invalid html source in response.text attribute. However, if I use the same proxy in my FireFox browser and try to open the same webpage, I am displayed an invalid webpage, but my python script says that the response should be valid.

Can someone point to me that what other necessary checks I am missing to weed out incorrect html results?

or

How can I successfully verify if the webpage I intended to receive is correct?

Regards.

Ozair Shafiq
  • 31
  • 1
  • 1
  • 5
  • try using proxies with `urllib` as done here [Proxy with Urllib2](http://stackoverflow.com/questions/1450132/proxy-with-urllib2) – Vaulstein Aug 17 '15 at 10:54
  • I was using Requests because I though it would be easier to use and understand. But, I just tried using urllib2 as you suggested and the result is same. Attributes response.url and response.code return the same values as with Requests API and the html is still invalid. – Ozair Shafiq Aug 17 '15 at 11:10

1 Answers1

0

What is an "invalid webpage" when displayed by your browser? The server can return a HTTP status code of 200, but the content is an error message. You understand it to be an error message because you can comprehend it, a browser or code can not.

If you have any knowledge about the content of the target page, you could check whether the returned HTML contains that content and accept it on that basis.

mhawke
  • 84,695
  • 9
  • 117
  • 138
  • By "invalid webpage" I mean a webpage which is not actually that I expected. For example, there is one proxy server which behaves the same with my python script but if I used it in my FireFox browser, whatever URL I try to open, it would redirect me to a web based 2D game. Other proxy servers redirect to authentication pages etc. – Ozair Shafiq Aug 17 '15 at 11:14
  • "If you have any knowledge about the content of the target page, you could check whether the returned HTML contains that content and accept it on that basis." This is a good point, and I think I am familiar with the target webpage but still should I have to parse the whole webpage to verify if its actually what I intended to receive? If I tried to get http://www.google.com, how would you suggest I should go about trying to verify if I am receiving the correct html source for this url? – Ozair Shafiq Aug 17 '15 at 11:18
  • If you look at the HTML returned when using 37.228.111.137:80 as a proxy you'll see that it uses a "meta refresh" tag to redirect the browser. `requests` isn't going to do that, but Firefox will. – mhawke Aug 17 '15 at 11:31
  • Thank you for pointing that out. I didn't knew about that. Apparently, the two checks in my python script are not enough to verify the webpage correctness. What else can I include to guarantee about whether the webpage is correct or not? – Ozair Shafiq Aug 17 '15 at 11:42