How can this scraping structure be improved to get a better control of the proxies used and the simultaneous tasks running?

Question

I'm a web scraping enthusiast that just learnt about asyncio and proxies in web scraping to improve performance. The program I wrote did its job of downloading a big number of sites pretty fast but after reviewing it I found two things that I didn't like at all: the different tasks running being unable to change their proxy in case they get a bad response in the middle of the execution and how poorly I managed to keep only a number of tasks,m, simultaneously.

Here is the code.

async def fetch_response(client, url, t):
    headers = {...}

    resp = await client.get(url, headers = headers, timeout = 30)

    if resp.status_code == httpx.codes.OK:
        return  [resp.text, str(resp.url)[108:113]]
    else:
        #In case the response isn't 200 the function calls itself again
        #until it receives a 200
        print(f'Error: {resp.status_code}')
        time.sleep(n)
        fetch_response(client, url, t+5)

async def main():
    proxies = getProxyList() #Get the proxies, currently they are 10
    files = os.listdir('results/') #Get the current downloaded files
    #Get the list of urls
    with open('urls.txt','r') as f:
        urls = [line.strip() for line in f.readlines()]
    
    n = 1 #Index to rotate the proxy after every call
    m = 10 #Number of concurrent calls
    t = 5 #Time the get_response waits in case it didn't work the first time
    results = [] #Stores the results of the tasks
    #The for loop is done with steps so every 
    for i in range(0,len(urls),m):
        async with httpx.AsyncClient(proxies = {'http://':proxies[n],'https://':proxies[n]}) as client:
            tasks = []
            for j in range(0,m):
                #If the file isn't already in the files list make a call
                if("result" + f"{i:0>5}" + ".json" not in files):
                    tasks.append(asyncio.create_task(get_response(client, urls[i+j], t)))
            results = await asyncio.gather(*tasks)
        
        #After the calls are back write the results in each file
        if len(results)>0:
            for k in results:   
                with open('results/result'+k[1]+'.json','w+') as f:
                    f.write(k[0])
            print(i)
        n += 1
        #10 is the number of proxies I currently have
        #and every m tasks they rotate
        if n == 10: 
            n=0

First the proxies. Currently I run m threads with one proxy ip and after they finish change the proxy to the next one in the list, however, apart from being ugly code in my opinion, I was wondering if there is any way to change the client proxy once it has been created, I mean, once the asyncClient starts a task, change it during get_response(client, url, t). Could this be done?

About how to control the number of tasks running. When I first wrote the program I just run the loop for i in range(0,len(urls)) but it first added all the tasks in the tasks array and then run all of them at the same time when they get gathered, which of course lead to the site banning me and the proxies I was using because it tried to do around 100.000 calls at the same time.

I couldn't find anything about how to control the number of tasks runnning at the same time or how to run them with some kind of control over them.

Also the get_response(client, url, t) function runs with recursion which I don't know if it's a good idea. I'm proud of it but it doesn't seem ideal either.

Lastly, if someone could give any feedback on how to improve the structure or what could I add to write a better program it would be highly aprecciated.

Semaphores do this. This [question](https://stackoverflow.com/questions/40836800/python-asyncio-semaphore-in-async-await-function) addresses a similar issue. — jwal, Aug 07 '23 at 09:17
Didn't know about semaphores, I'll try to implement it. Thanks it seems to be what I was looking for. — sushiwithoutsushi, Aug 09 '23 at 12:02

How can this scraping structure be improved to get a better control of the proxies used and the simultaneous tasks running?

0 Answers0