0

Im writing a web scraping program in python using mechanize. The problem I'm having is that the website I'm scraping from limits the amount of time that you can be on the website. When I was doing everything by hand, I would use a SOCKS proxy as a work-around.

What I tried to do is go to the network preferences (Macbook Pro Retina 13', mavericks) and change to the proxy. However, the program didn't respond to that change. It kept running without the proxy.

Then I added .set_proxies() so now the code to open the website looks something like this:

b=mechanize.Browser()                               #open browser
b.set_proxies({"http":"96.8.113.76:8080"})          #proxy
DBJ=b.open(URL)                                     #open url

When I ran the program, I got this error:

Traceback (most recent call last):
 File "GM1.py", line 74, in <module>
   DBJ=b.open(URL)                  
 File "build/bdist.macosx-10.9-intel/egg/mechanize/_mechanize.py", line 203, in open
 File "build/bdist.macosx-10.9-intel/egg/mechanize/_mechanize.py", line 230, in _mech_open
 File "build/bdist.macosx-10.9-intel/egg/mechanize/_opener.py", line 193, in open
 File "build/bdist.macosx-10.9-intel/egg/mechanize/_urllib2_fork.py", line 344, in _open
 File "build/bdist.macosx-10.9-intel/egg/mechanize/_urllib2_fork.py", line 332, in _call_chain
 File "build/bdist.macosx-10.9-intel/egg/mechanize/_urllib2_fork.py", line 1142, in http_open
 File "build/bdist.macosx-10.9-intel/egg/mechanize/_urllib2_fork.py", line 1118, in do_open
urllib2.URLError: <urlopen error [Errno 54] Connection reset by peer>

Im assuming that the proxy was changed and that this error is in response to that proxy.

Maybe I am misusing .set_proxies().

Im not sure if the proxy itself is the issue or the connection is really slow.

Should I even be using SOCKS proxies for this type of thing or is there a better alternative for what I am trying to do?

Any information would be extremely helpful. Thanks in advance.

oe28
  • 7
  • 5

1 Answers1

2

A SOCKS proxy is not the same as a HTTP proxy. The protocol between client and proxy is different. The line:

b.set_proxies({"http":"96.8.113.76:8080"})

tells mechanize to use the HTTP proxy at 96.8.113.76:8080 for requests having the http scheme in the URL, e.g. a request for URL http://httpbin.org/get will be sent via the proxy at 96.8.113.76:8080. Mechanize expects this to be a HTTP proxy server, and uses the corresponding protocol. It seems that your SOCKS proxy is closing the connection because it is not receiving a valid SOCKS proxy request (because it is a actually a HTTP proxy request).

I don't think that mechanize has builtin support for SOCKS, so you may have to resort to some dirty tricks such as those in this answer. For that you will need to install the PySocks package. This might work for you:

import socks
import socket
from mechanize import Browser

SOCKS_PROXY_HOST = '96.8.113.76'
SOCKS_PROXY_PORT = 8080

def create_connection(address, timeout=None, source_address=None):
    sock = socks.socksocket()
    sock.connect(address)
    return sock

# add username and password arguments if proxy authentication required.
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, SOCKS_PROXY_HOST, SOCKS_PROXY_PORT)

# patch the socket module
socket.socket = socks.socksocket
socket.create_connection = create_connection

br = Browser()
response = br.open('http://httpbin.org/get')

>>> print response.read()
{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/2.7", 
    "X-Request-Id": "e728cd40-002c-4f96-a26a-78ce4d651fda"
  }, 
  "origin": "192.161.1.100", 
  "url": "http://httpbin.org/get"
}
Community
  • 1
  • 1
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • I'm getting: mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt and after adding br.set_handle_robots(False): Operation timed out – Makalele Apr 11 '17 at 08:41