10

I am running some web crawling jobs on an AWS hosted server. The crawler scrapes data from an eCommerce website but recently the crawler gets "timeout errors" from the website. The website might have limited my visiting frequency based on my IP address. Allocating a new Elastic-IP address solves the problem, but not for long.

My Question: Is there any service that I can use to automatically and dynamically allocate & associate new IPs to my instance? Thanks!

xiaolong
  • 3,396
  • 4
  • 31
  • 46
  • did you considerer to use tor? – Gabriele Santomaggio Apr 08 '14 at 16:35
  • @Gas Thanks! Does it work if I choose not to use Tor browser? My crawler(written in Java) fires HTTP requests directly to the target website, instead of invoking a real browser. – xiaolong Apr 08 '14 at 16:48
  • http://docs.aws.amazon.com/cli/latest/reference/ec2/allocate-address.html – Uri Agassi Apr 08 '14 at 17:54
  • @UriAgassi, thanks, I know I can allocate new Elastic-IP in the Admin Console or using a CLI tool. Is there a tool I can do this automatically? Or basically I need to write my own scripts? thx – xiaolong Apr 08 '14 at 18:02
  • you should write your own scripts – Uri Agassi Apr 08 '14 at 18:07
  • If the website is blocking you, they are doing so because they think you are causing problems for the site. Your best options are to get permission to crawl the site (preferably with API access), or change your strategy to crawl more slowly. – datasage Apr 08 '14 at 19:42
  • @datasage, good point. API access was my initial thought but the target site doesn't have API provided for data collecting. Also, data needs to be quickly collected within one day or two by the end of every month, so a slow/gentle strategy doesn't work.. thx – xiaolong Apr 08 '14 at 19:49

2 Answers2

6

To change the EIP you can just use Python boto

Something like this:

#!/usr/bin/python

import boto.ec2

conn = boto.ec2.connect_to_region("us-east-1",
    aws_access_key_id='<key>',
    aws_secret_access_key='<secret>')


reservations = ec2_conn.get_all_instances(filters={'instance-id' : 'i-xxxxxxxx'})
instance = reservations[0].instances[0]

old_address = instance.ip_address
new_address = conn.allocate_address().public_ip

conn.disassociate_address(old_address)
conn.associate_address('i-xxxxxxxx', new_address)
Rico
  • 58,485
  • 12
  • 111
  • 141
1

If you want use TOR network just execute:

sudo apt-get install tor 
sudo /etc/init.d/tor start

 netstat -ant | grep 9050 #  Tor port

and in your java project you set the proxy as:

public static void main(String[] args) {
    System.setProperty("socksProxyHost", "127.0.0.1");
    System.setProperty("socksProxyPort", "9050");

you can scheduler a cron job that each XX time reboot your application and tor.

Easy and secure.

Gabriele Santomaggio
  • 21,656
  • 4
  • 52
  • 52
  • Great, thanks for this nice alternative. Using this approach sounds like I will need to modify my Java code and reboot Tor. Allocating new IP is easier for maintenance. I'll check this out if that doesn't work well. – xiaolong Apr 08 '14 at 20:29