I am using Tor servers to route the requests of my crawler, which is multithreaded but nonetheless very easy on loading since I make each thread sleep for a random normal time with a mean of 20 seconds (approx 3 requests a minute). I need to get first google search result for some 20,000 odd queries. My crawler is scripted in python using urllib2 (socks proxy) and mechanize (http proxy).
# Snippet of code initializing the urllib2 build_opener
host = socks_hostname
port = socks_port
socks_username = username
socks_password = password
cj = cookielib.CookieJar()
br = urllib2.build_opener(SocksiPyHandler(socks.PROXY_TYPE_SOCKS5, host, port,
username=socks_username,
password=socks_password),
urllib2.HTTPCookieProcessor(cj))
# Get randomly generated User-Agent string.
br.addheaders = [('User-Agent', self.get_user_agent())]
return br
I just discovered that Tor network isn't hiding my IP as far as google is concerned. I wrote a small test script to check the ip address from google and from http://whatismyip.net. While whatismyip.net seems to get some ip based from Canada, Google shows my real ip, this confuses me. I have made sure that I don't have any cookies that can be tracked.
What is even more puzzling is that, when I use the tor in my firefox, then google shows a random ip based in Canada as well. So, it's only when I send automated requests, that my real ip gets exposed, can someone help me figure out what is causing this leak?
I understand crawling is a sensitive topic, but the rate of my crawling is actually slower than a human being!