I am using a relatively cookie cutter code to asynchronously request the HTMLs from a few hundred urls that I have scraped with another piece of code. The code works perfectly.
Unfortunately, this is causing my IP to be blocked due to the high number of requests.
My thought is to write some code to grab some proxy IP addresses, place them in a list, and cycle through them randomly as the requests are sent. Assuming I have no problems in creating this list, I am having trouble conceptualising how to splice the random rotation of these proxy IPs into my asychronous request code. This is my code so far.
async def download_file(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
content = await resp.read()
return content
async def write_file(n, content):
filename = f'sync_{n}.html'
with open(filename, 'wb') as f:
f.write(content)
async def scrape_task(n, url):
content = await download_file(url)
await write_file(n, content)
async def main():
tasks = []
for n, url in enumerate(open('links.txt').readlines()):
tasks.append(scrape_task(n,url))
await asyncio.wait(tasks)
if __name__ == '__main__':
asyncio.run(main())
I am thinking that I need to put:
conn = aiohttp.TCPConnector(local_addr=(x, 0), loop=loop)
async with aiohttp.ClientSession(connector=conn) as session:
...
as the second and third lines of my code, where x is going to be one of the random IP addresses from a list earlier defined. How would I go about doing this? I am unsure if placing the whole code in a simple synchoronous loop will defeat the purpose of using the asynchronous requests.
If there is a simpler solution to the problem of being blocked from a website for rapid fire requests, that would be very helpful too. Please note I am very new to coding.