Easiest way to crawl a site with an .onion domain?

Question

I'm trying to build a site that crawls various pages that are hosted on an .onion domain. That means its not as simple as just calling requests.get("http://XXX.onion"), because .onion is only available by connecting through TOR.

I could use a redirector like onion.to, but that requires a click through, which won't work when I'm crawling.

I don't care about anononimity, I just want the data.

I would like to note that your username is kind of a painful combination with the question. — orlp, Aug 05 '13 at 04:23

score 1 · Answer 1 · edited May 23 '17 at 12:30

Requests supports HTTP proxies, but not SOCKS proxies, which is what Tor provides you.

You can either get a patched version of requests: How to make python Requests work via socks proxy

Or install Polipo and use it as another proxy to "transform" Tor's SOCKS5 proxy into a HTTP/HTTPS proxy. Here's my config file:

proxyName = "localhost"
proxyAddress = "127.0.0.1"
proxyPort = 8118

allowedClients = 127.0.0.1
allowedPorts = 1-65535

cacheIsShared = false
chunkHighMark = 67108864

socksParentProxy = "localhost:9050"
socksProxyType = socks5


diskCacheRoot = ""
localDocumentRoot = ""

disableLocalInterface = true
disableConfiguration = true
disableVia = true

dnsUseGethostbyname = yes

maxConnectionAge = 5m
maxConnectionRequests = 120

serverMaxSlots = 8
serverSlots = 2

tunnelAllowedPorts = 1-65535

Now, you can just use the proxies with requests:

proxies = {
    'http': 'localhost:8118',
    'https': 'localhost:8118'
}

requests.get('http://something.onion/', proxies=proxies)

I tried the `requesocks` method, but it does not work with .onion domains. It works for regular domains though... The Polipo approach is not ideal, but seems like my only option. — priestc, Aug 05 '13 at 05:17

score 0 · Answer 2 · answered Aug 28 '13 at 01:39

0

Why do not you setup Tor and use a bunch of wget and torsocks?

e.g.

# torsocks wget -c -mirror http://kpvz7ki2v5agwt35.onion

answered Aug 28 '13 at 01:39

innocent-world

548
2
7
11

Easiest way to crawl a site with an .onion domain?

2 Answers2