55

I'm trying to crawl websites using a crawler written in Python. I want to integrate Tor with Python meaning I want to crawl the site anonymously using Tor.

I tried doing this. It doesn't seem to work. I checked my IP it is still the same as the one before I used tor. I checked it via python.

import urllib2
proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
Obito
  • 391
  • 3
  • 8
  • 5
    Just to let you know, whatismyipaddress' terms of service: You may not use a script, agent, application or otherwise query this website in an automated fashion without prior written permission. – LiraNuna Jul 08 '09 at 06:24
  • 5
    Duplicate of http://stackoverflow.com/questions/711351/using-urllib-with-tor – LiraNuna Jul 08 '09 at 06:30
  • 6
    Given that there were no accepted or particularly useful answers on that other thread, I would vote to keep this thread open as it is still valid in my opinion. – jrista Jul 08 '09 at 06:35
  • 1
    Not quite a dupe, I think - that was a more general question, this is asking for help with a specific code snippet. – Vinay Sajip Jul 08 '09 at 06:36
  • You can chechk this port, it helped me. http://stackoverflow.com/questions/9887505/changing-tor-identity-inside-python-script – torayeff Jun 09 '12 at 11:26

12 Answers12

23

You are trying to connect to a SOCKS port - Tor rejects any non-SOCKS traffic. You can connect through a middleman - Privoxy - using Port 8118.

Example:

proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support) 
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open('http://www.google.com').read()

Also please note properties passed to ProxyHandler, no http prefixing the ip:port

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
Dmitri Farkov
  • 9,133
  • 1
  • 29
  • 45
  • 1
    8118 is not a Tor port, its a privoxy port! Tor listens on 9050 by default. You need to connect to 8118 though, because you are trying to connect via HTTP Proxy, which is what privoxy provides. – Pankaj Dec 24 '12 at 05:45
  • 7
    This answer is bad, and you should feel bad. Tor control port is 9051, not 9050. 9050 is the socks port that you can use like this http://stackoverflow.com/questions/2317849/how-can-i-use-a-socks-4-5-proxy-with-urllib2 – s3v3n Jan 02 '13 at 15:25
  • I'll edit. I mistook Privoxy's with Tor's port, however, the end result is the same despite middleman. Especially, since most Tor installations come bundled along with Privoxy. – Dmitri Farkov Jan 08 '13 at 23:32
  • 1
    You're confusing `Tor` with `Tor Bundle`. `Tor Bundle` indeed comes with `Vidalia`, `Privoxy` and `Firefox`, but there is also a standalone `Tor` that on linux can be installed with `apt-get`/`yum`. – s3v3n Jan 23 '13 at 13:18
  • 1
    Ah, my bad. Either way, at no point have I tried to pass off myself as a Tor expert, just suggested a solution that has worked for me. – Dmitri Farkov Jan 24 '13 at 22:05
9
pip install PySocks

Then:

import socket
import socks
import urllib2

ipcheck_url = 'http://checkip.amazonaws.com/'

# Actual IP.
print(urllib2.urlopen(ipcheck_url).read())

# Tor IP.
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050)
socket.socket = socks.socksocket
print(urllib2.urlopen(ipcheck_url).read())

Using just urllib2.ProxyHandler as in https://stackoverflow.com/a/2015649/895245 fails with:

Tor is not an HTTP Proxy

Mentioned at: How can I use a SOCKS 4/5 proxy with urllib2?

Tested on Ubuntu 15.10, Tor 0.2.6.10, Python 2.7.10.

Community
  • 1
  • 1
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
  • 2
    With 9050 it's not working for me in Python 3. I'm getting the following error: `urllib.error.URLError: `. With **9150** instead of 9050 works, though. – J0ANMM Oct 20 '16 at 12:34
  • @JoanMM thanks for report. Please give your exact OS, python and tor versions. Does it work on Python 2 for you? – Ciro Santilli OurBigBook.com Oct 20 '16 at 13:06
  • Mac OS X Version 10.9.5 / Python 3.5.2 / Tor Browser for Mac Version 6.0.5 - OS X (10.6+). I didn't check in Python 2, I'm only using Python 3. – J0ANMM Oct 20 '16 at 13:12
3

The following code is 100% working on Python 3.4

(you need to keep TOR Browser open wil using this code)

This script connects to TOR through socks5 get the IP from checkip.dyn.com, change identity and resend the request to get a the new IP (loops 10 times)

You need to install the appropriate libraries to get this working. (Enjoy and don't abuse)

import socks
import socket
import time
from stem.control import Controller
from stem import Signal
import requests
from bs4 import BeautifulSoup
err = 0
counter = 0
url = "checkip.dyn.com"
with Controller.from_port(port = 9151) as controller:
    try:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
        socket.socket = socks.socksocket
        while counter < 10:
            r = requests.get("http://checkip.dyn.com")
            soup = BeautifulSoup(r.content)
            print(soup.find("body").text)
            counter = counter + 1
            #wait till next identity will be available
            controller.signal(Signal.NEWNYM)
            time.sleep(controller.get_newnym_wait())
    except requests.HTTPError:
        print("Could not reach URL")
        err = err + 1
print("Used " + str(counter) + " IPs and got " + str(err) + " errors")
Amine
  • 31
  • 1
2

Using privoxy as http-proxy in front of tor works for me - here's a crawler-template:


import urllib2
import httplib

from BeautifulSoup import BeautifulSoup
from time import sleep

class Scraper(object):
    def __init__(self, options, args):
        if options.proxy is None:
            options.proxy = "http://localhost:8118/"
        self._open = self._get_opener(options.proxy)

    def _get_opener(self, proxy):
        proxy_handler = urllib2.ProxyHandler({'http': proxy})
        opener = urllib2.build_opener(proxy_handler)
        return opener.open

    def get_soup(self, url):
        soup = None
        while soup is None:
            try:
                request = urllib2.Request(url)
                request.add_header('User-Agent', 'foo bar useragent')
                soup = BeautifulSoup(self._open(request))
            except (httplib.IncompleteRead, httplib.BadStatusLine,
                    urllib2.HTTPError, ValueError, urllib2.URLError), err:
                sleep(1)
        return soup

class PageType(Scraper):
    _URL_TEMPL = "http://foobar.com/baz/%s"

    def items_from_page(self, url):
        nextpage = None
        soup = self.get_soup(url)

        items = []
        for item in soup.findAll("foo"):
            items.append(item["bar"])
            nexpage = item["href"]

        return nextpage, items

    def get_items(self):
        nextpage, items = self._categories_from_page(self._START_URL % "start.html")
        while nextpage is not None:
            nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage)
            items.extend(newitems)
        return items()

pt = PageType()
print pt.get_items()
2

Update - The latest (upwards of v2.10.0) requests library supports socks proxies with an additional requirement of requests[socks].

Installation -

pip install requests requests[socks]

Basic usage -

import requests
session = requests.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http':  'socks5://127.0.0.1:9050',
                   'https': 'socks5://127.0.0.1:9050'}

# Make a request through the Tor connection
# IP visible through Tor
print session.get("http://httpbin.org/ip").text
# Above should print an IP different than your public IP

# Following prints your normal public IP
print requests.get("http://httpbin.org/ip").text

Old answer - Even though this is an old post, answering because no one seems to have mentioned the requesocks library.

It is basically a port of the requests library. Please note that the library is an old fork (last updated 2013-03-25) and may not have the same functionalities as the latest requests library.

Installation -

pip install requesocks

Basic usage -

# Assuming that Tor is up & running
import requesocks
session = requesocks.session()
# Tor uses the 9050 port as the default socks port
session.proxies = {'http':  'socks5://127.0.0.1:9050',
                   'https': 'socks5://127.0.0.1:9050'}
# Make a request through the Tor connection
# IP visible through Tor
print session.get("http://httpbin.org/ip").text
# Above should print an IP different than your public IP
# Following prints your normal public IP
import requests
print requests.get("http://httpbin.org/ip").text
shad0w_wa1k3r
  • 12,955
  • 8
  • 67
  • 90
  • 1
    you can also just use the built-in latest version of requests has a proxy= parameter where you can pass `socks5:127.0.0.1:9050` – james-see Nov 01 '16 at 17:52
2

The following solution works for me in Python 3. Adapted from CiroSantilli's answer:

With urllib (name of urllib2 in Python 3):

import socks
import socket
from urllib.request import urlopen

url = 'http://icanhazip.com/'

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket

response = urlopen(url)
print(response.read())

With requests:

import socks
import socket
import requests

url = 'http://icanhazip.com/'

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150)
socket.socket = socks.socksocket

response = requests.get(url)
print(response.text)

With Selenium + PhantomJS:

from selenium import webdriver

url = 'http://icanhazip.com/'

service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ]
phantomjs_path = '/your/path/to/phantomjs'

driver = webdriver.PhantomJS(
    executable_path=phantomjs_path, 
    service_args=service_args)

driver.get(url)
print(driver.page_source)
driver.close()

Note: If you are planning to use Tor often, consider making a donation to support their awesome work!

Community
  • 1
  • 1
J0ANMM
  • 7,849
  • 10
  • 56
  • 90
2

Here is a code for downloading files using tor proxy in python: (update url)

import urllib2

url = "http://www.disneypicture.net/data/media/17/Donald_Duck2.gif"

proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()
carloona
  • 21
  • 1
1

Perhaps you're having some network connectivity issues? The above script worked for me (I substituted a different URL - I used http://stackoverflow.com/ - and I get the page as expected:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" >
 <html> <head>

<title>Stack Overflow</title>        
<link rel="stylesheet" href="/content/all.css?v=3856">

(etc.)

Vinay Sajip
  • 95,872
  • 14
  • 179
  • 191
0

Tor is a socks proxy. Connecting to it directly with the example you cite fails with "urlopen error Tunnel connection failed: 501 Tor is not an HTTP Proxy". As others have mentioned you can get around this with Privoxy.

Alternatively you can also use PycURL or SocksiPy. For examples of using both with tor see...

https://stem.torproject.org/tutorials/to_russia_with_love.html

Damian
  • 2,944
  • 2
  • 18
  • 15
0

you can use torify

run your program with

~$torify python your_program.py
Mohamed Emad
  • 123
  • 1
  • 14
0

Thought I would just share a solution that worked for me (python3, windows10):

Step 1: Enable your Tor ControlPort at 9151.

Tor service runs at default port 9150 and ControlPort on 9151. You should be able to see local address 127.0.0.1:9150 and 127.0.0.1:9151 when you run netstat -an.

[go to windows terminal]
cd ...\Tor Browser\Browser\TorBrowser\Tor
tor --service remove
tor --service install -options ControlPort 9151
netstat -an 

Step 2: Python script as follow.

# library to launch and kill Tor process
import os
import subprocess

# library for Tor connection
import socket
import socks
import http.client
import time
import requests
from stem import Signal
from stem.control import Controller

# library for scraping
import csv
import urllib
from bs4 import BeautifulSoup
import time

def launchTor():
    # start Tor (wait 30 sec for Tor to load)
    sproc = subprocess.Popen(r'.../Tor Browser/Browser/firefox.exe')
    time.sleep(30)
    return sproc

def killTor(sproc):
    sproc.kill()

def connectTor():
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True)
    socket.socket = socks.socksocket
    print("Connected to Tor")

def set_new_ip():
    # disable socks server and enabling again
    socks.setdefaultproxy()
    """Change IP using TOR"""
    with Controller.from_port(port=9151) as controller:
        controller.authenticate()
        socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True)
        socket.socket = socks.socksocket
        controller.signal(Signal.NEWNYM)

def checkIP():
    conn = http.client.HTTPConnection("icanhazip.com")
    conn.request("GET", "/")
    time.sleep(3)
    response = conn.getresponse()
    print('current ip address :', response.read())

# Launch Tor and connect to Tor network
sproc = launchTor()
connectTor()

# list of url to scrape
url_list = [list of all the urls you want to scrape]

for url in url_list:
    # set new ip and check ip before scraping for each new url
    set_new_ip()
    # allow some time for IP address to refresh
    time.sleep(5)
    checkIP()

    '''
    [insert your scraping code here: bs4, urllib, your usual thingy]
    '''

# remember to kill process 
killTor(sproc)

This script above will renew IP address for every URL that you want to scrape. Just make sure to sleep it long enough for IP to change. Last tested yesterday. Hope this helps!

KittyBot
  • 41
  • 2
0

To expand on the above comment about using torify and the Tor browser (and doesn't need Privoxy):

pip install PySocks
pip install pyTorify

(install Tor browser and start it up)

Command line usage:

python -mtorify -p 127.0.0.1:9150 your_script.py

Or built into a script:

import torify
torify.set_tor_proxy("127.0.0.1", 9150)
torify.disable_tor_check()
torify.use_tor_proxy()

# use urllib as normal
import urllib.request
req = urllib.request.Request("http://....")
req.add_header("Referer", "http://...") # etc
res = urllib.request.urlopen(req)
html = res.read().decode("utf-8")

Note, the Tor browser uses port 9150, not 9050

Steve Lockwood
  • 574
  • 4
  • 6