0

i need to do many requests to one url, but after ~20 requests, I get a 429 too many requests. So my plan was to use proxy requests. I have tried 3 things:

  • Tor-proxy using python
  • Free proxy lists
  • ScraperApi

But all of them(even the scraperApi-trial) are unbelieveably slow, like 5-10 seconds each request. An example looks like this:

import requests

url = "https://httpbin.org/ip"
proxies = {"https": "164.155.149.1:80"}
r = requests.get(url,proxies=proxies)
print(r.text)

The proxy-ip was from some free proxy website. Sure, proxies are an extra node inbetween but was hoping to find a way to get proxies which at maximum take 1 second..

Is there any way to solve this issue?

Thanks in advance

codedor
  • 73
  • 6
  • Try to find out the exact number of request you can do in some time (1min or 10min) and not get banned, also make pool of user-agent and go changing them for each request. You need more time to do more request, there is no magic. – MeT Jun 10 '22 at 15:24

1 Answers1

1

Codedor, one way I could think is:

  • Create a pool of EC2 instances on AWS(or any other cloud service provider of your choice). These can be the cheapest ones - even spot instances on AWS.
  • Round-robin your requests from each of these VMs. Since each VM will have it's own public IP, you are less likely to get "429 too many requests" sooner. The more instances you have, the less likely.

Eg:

  • Say you have 10 VMs.
  • In each VM you make 1 request/5s = 12 requests/min.
  • Altogether you will make 12X10 = 120 requests/min.
  • Add reasonable delays.

Distributing the jobs on the VM would be a little trickier - but doable. You can have a master node running a Python script, that iterates through the VMs and spawns the request command on them. You could use various libraries to execute a command on a remote machine in Python - like paramiko, subprocess, os, etc.

Loner
  • 166
  • 1
  • 4
  • 10