0

I am using Tor servers to route the requests of my crawler, which is multithreaded but nonetheless very easy on loading since I make each thread sleep for a random normal time with a mean of 20 seconds (approx 3 requests a minute). I need to get first google search result for some 20,000 odd queries. My crawler is scripted in python using urllib2 (socks proxy) and mechanize (http proxy).

# Snippet of code initializing the urllib2 build_opener
host = socks_hostname
port = socks_port
socks_username = username
socks_password = password
cj = cookielib.CookieJar()
br = urllib2.build_opener(SocksiPyHandler(socks.PROXY_TYPE_SOCKS5, host, port, 
                         username=socks_username, 
                         password=socks_password), 
                         urllib2.HTTPCookieProcessor(cj))

# Get randomly generated User-Agent string.

br.addheaders = [('User-Agent', self.get_user_agent())]
return br

I just discovered that Tor network isn't hiding my IP as far as google is concerned. I wrote a small test script to check the ip address from google and from http://whatismyip.net. While whatismyip.net seems to get some ip based from Canada, Google shows my real ip, this confuses me. I have made sure that I don't have any cookies that can be tracked.

What is even more puzzling is that, when I use the tor in my firefox, then google shows a random ip based in Canada as well. So, it's only when I send automated requests, that my real ip gets exposed, can someone help me figure out what is causing this leak?

I understand crawling is a sensitive topic, but the rate of my crawling is actually slower than a human being!

harshal.c
  • 306
  • 3
  • 12
  • Can you somehow share your code? It's hard to blindly say anything. Or at least share your tests. – khajvah Dec 30 '15 at 19:05
  • @khajvah I edited the question, added a snippet of the code that I use to create urllib2 opener – harshal.c Dec 30 '15 at 19:07
  • How do you route the traffic through Tor routers? – khajvah Dec 30 '15 at 19:12
  • @khajvah The SocksiPyHandler sets socks proxy and authentication in the urllib2 opener. In case of HTTP proxy I just normally set the proxy with its methods, but get same results. As far as routing through Tor servers go, I am sure its happening, because otherwise http://whatismyip.net won't show me a different ip. – harshal.c Dec 30 '15 at 19:15
  • I haven't worked with Tor but one dumb guess: Is there a possibility that `Https` traffic isn't being routed? – khajvah Dec 30 '15 at 19:19
  • @kjahvah It was one of the first things I checked for. All my traffic is being routed through Tor. I am inclined to believe that there is some leak during DNS resolution of Tor, as my school uses Google DNS, but don't know how to check that. – harshal.c Dec 30 '15 at 19:21
  • The leak wouldn't cause Google to show you the leaked IP. Google and others would, I think, just show who directly connected to them. [Others](https://stackoverflow.com/questions/23220494/tor-doesnt-work-with-urllib2?rq=1) had a similar problem but it seems it was https issue. – khajvah Dec 30 '15 at 19:40
  • https://stem.torproject.org/faq.html could be an alternative. – ρss Dec 30 '15 at 20:07

0 Answers0