Cycling through IP addresses in Asynchronous Webscraping

Question

I am using a relatively cookie cutter code to asynchronously request the HTMLs from a few hundred urls that I have scraped with another piece of code. The code works perfectly.

Unfortunately, this is causing my IP to be blocked due to the high number of requests.

My thought is to write some code to grab some proxy IP addresses, place them in a list, and cycle through them randomly as the requests are sent. Assuming I have no problems in creating this list, I am having trouble conceptualising how to splice the random rotation of these proxy IPs into my asychronous request code. This is my code so far.

async def download_file(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            content = await resp.read()
            return content

async def write_file(n, content):
    filename = f'sync_{n}.html'
    with open(filename, 'wb') as f:
        f.write(content)

async def scrape_task(n, url):
    content = await download_file(url)
    await write_file(n, content)

async def main():
    tasks = []
    for n, url in enumerate(open('links.txt').readlines()):
        tasks.append(scrape_task(n,url))
    await asyncio.wait(tasks)

if __name__ == '__main__':
    asyncio.run(main())

I am thinking that I need to put:

conn = aiohttp.TCPConnector(local_addr=(x, 0), loop=loop)
async with aiohttp.ClientSession(connector=conn) as session:
    ...

as the second and third lines of my code, where x is going to be one of the random IP addresses from a list earlier defined. How would I go about doing this? I am unsure if placing the whole code in a simple synchoronous loop will defeat the purpose of using the asynchronous requests.

If there is a simpler solution to the problem of being blocked from a website for rapid fire requests, that would be very helpful too. Please note I am very new to coding.

Using `local_addr` to spoof your IP doesn't sound like it will actually work. The purpose of `local_addr` is to bind the socket to a specific address on a machine with multiple network interfaces. To use proxy servers, you should tell aiohttp to request the URL via proxy by setting up the `http_proxy` environment variable and specifying `trust_env=True` when creating the session. Maybe the server is banning you because you are sending all your requests at once. Have you tried limiting the number of requests or adding some pause between them? — user4815162342, Jun 19 '20 at 09:08
Yes I think you are right by suggesting that the server is banning me because of sending all the requests all at once. When I was sending the requests synchronously I wasn't getting banned, but ideally I need the responses as quick as possible. I may try your first suggestion if you had any more feedback about how to best implement that? — TNoms, Jun 19 '20 at 09:54
It's not something you can implement without having access to proxy servers (or setting them up yourself). In your place I'd first try to speed up scraping without going to the extreme of getting everything at once. Have you tried fetching 3 or 5 items at once, for example? — user4815162342, Jun 19 '20 at 10:28
Also, if your addresses point to different sites, have you tied making it so that you don't hammer the same site at once? — user4815162342, Jun 19 '20 at 10:36
They are all from the same site, unfortunately. I was thinking of just scraping some IP server proxies from https://free-proxy-list.net/ first, and then using them. Maybe I am not understanding what you are saying. I have not tried fetching a few items and then waiting. I will see if that works. (Or at least try to figure out how to do it!) — TNoms, Jun 19 '20 at 10:45
You don't even have to wait, just try to fetch no more than a couple of items at the same time instead of fetching them **all** at once. See e.g. [this answer](https://stackoverflow.com/a/61478547/1600898). — user4815162342, Jun 19 '20 at 12:33
Thank you for this guidance. I will need to stare at that answer for a while before I could come close to implementing it in my code though - not sure where to start. — TNoms, Jun 19 '20 at 13:10
It's intended as a drop-in replacement for `gather`, so you can use it immediately. Just replace `await asyncio.wait(tasks)` with `await gather_with_concurrency(1, *tasks)`. That should work exactly the same as the sequential code. Replace 1 with 2 and your code should work twice as fast, but work with 2 parallel connections at all times, loading the server twice as much. Change 2 to 3 and the load will be 3x the sequential one, and so on. If you just use `gather` (or `wait`, as you did), you are sending **all** the requests at once, and it's understandable that the server rejects that. — user4815162342, Jun 19 '20 at 13:37
Ah this makes sense now. I will try it out for myself tomorrow. Thank you very much for your help (and your patience!) — TNoms, Jun 19 '20 at 13:45
Good luck. Please note that this kind of guidance through comments is not the norm for StackOverflow - you're supposed to post a well-defined problem which has a clear answer, and comments should serve to request clarification and result in an update to the question. If you confirm that the proposed solution works for you, I'll post an answer or mark your question as duplicate of the previous one whose answer I linked. — user4815162342, Jun 19 '20 at 14:10

Cycling through IP addresses in Asynchronous Webscraping

0 Answers0