1

Currently I'm crawling multiple website using several threads (URL connection approach) with only single IP address and got block by some websites.

And we want to somehow prevent this problem. Which leads me to think of our virtual machine that has multiple IP addresses.

I'd like to ask is there a way in Java to utilized these local IPs in different URL connections that run under different java threads?

I have tried using proxy, but it seems not working because I believe the local IPs should not be proxied.

Here is what I tried: Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(InetAddress.getByAddress(ip), 8080));

Another solution from Define source ip address using Apache HttpClient does not work because the functions are depreciated.

Much appreciate your knowledge if someone encountered the same scenario.

Jin
  • 101
  • 2
  • 7

2 Answers2

1

Found a solution using latest HttpClient's RequestConfig, here is my code:

String ipAddress = "xxx.xxx.xxx.xxx"; // your intend source IP
byte ip[] = InetAddress.getByName(ipAddress).getAddress();
RequestConfig config = RequestConfig.custom()
    .setLocalAddress(InetAddress.getByAddress(ip))
    .build();
HttpClient client = HttpClientBuilder.create().build();
HttpGet getResquest = new HttpGet(address);
getResquest.setConfig(config);
HttpResponse response = client.execute(getResquest);

Just in case for people who encounter the same problem.

A lot of answer from stackoverflow used the previous HttpClient with getParem methods, which is now deplicated and should now use RequestConfig for modification.

Jin
  • 101
  • 2
  • 7
0

You're not going to be able to get very far. The IP addresses will all have to be valid within your domain, otherwise the routing between your computer and the web server won't work.

So the traffic will be indentifiable as having come from one single domain. And if you're behind an IPv4 NATS all your traffic will appear to come from one single IP address anyway, undoing what you're trying to do. If you're running IPv6 it will still look like all he traffic is coming from the same place. There's nothing you can do that will allow the traffic to appear to be coming from different domains and succeed in making a connection. The TCP packets have to route successfully, and that's not going to happen if the return address is not in your domain.

It's unsurprising that some websites blocked your requests - too many connection attempts from one place looks a bit like a DOS attack, distinctly unfriendly. Your best option is to contact the website owners and ask permission. Given that traffic costs money, they will quite rightly want to know what's in it for them.

bazza
  • 7,580
  • 15
  • 22