1

I'm trying to build a site that crawls various pages that are hosted on an .onion domain. That means its not as simple as just calling requests.get("http://XXX.onion"), because .onion is only available by connecting through TOR.

I could use a redirector like onion.to, but that requires a click through, which won't work when I'm crawling.

I don't care about anononimity, I just want the data.

priestc
  • 33,060
  • 24
  • 83
  • 117

2 Answers2

1

Requests supports HTTP proxies, but not SOCKS proxies, which is what Tor provides you.

You can either get a patched version of requests: How to make python Requests work via socks proxy

Or install Polipo and use it as another proxy to "transform" Tor's SOCKS5 proxy into a HTTP/HTTPS proxy. Here's my config file:

proxyName = "localhost"
proxyAddress = "127.0.0.1"
proxyPort = 8118

allowedClients = 127.0.0.1
allowedPorts = 1-65535

cacheIsShared = false
chunkHighMark = 67108864

socksParentProxy = "localhost:9050"
socksProxyType = socks5


diskCacheRoot = ""
localDocumentRoot = ""

disableLocalInterface = true
disableConfiguration = true
disableVia = true

dnsUseGethostbyname = yes

maxConnectionAge = 5m
maxConnectionRequests = 120

serverMaxSlots = 8
serverSlots = 2

tunnelAllowedPorts = 1-65535

Now, you can just use the proxies with requests:

proxies = {
    'http': 'localhost:8118',
    'https': 'localhost:8118'
}

requests.get('http://something.onion/', proxies=proxies)
Community
  • 1
  • 1
Blender
  • 289,723
  • 53
  • 439
  • 496
  • 1
    I tried the `requesocks` method, but it does not work with .onion domains. It works for regular domains though... The Polipo approach is not ideal, but seems like my only option. – priestc Aug 05 '13 at 05:17
  • 1
    @priestc: Do you have Tor running? – Blender Aug 05 '13 at 05:18
0

Why do not you setup Tor and use a bunch of wget and torsocks?

e.g.

# torsocks wget -c -mirror http://kpvz7ki2v5agwt35.onion
innocent-world
  • 548
  • 2
  • 7
  • 11