0

I am running one code where the python script is basically scraping through a list of links.
But the process is too slow.I want to divide my code into several process where the code is simultaneously scrapping through multiple links at once.

List of Links is almost 5000.

Here is my code which I want to run in parallel

#links contains list of links 
def fun():
    for link in links:
        requests.get(link,timeout=5)
        ###... scraping code
        #####
ti7
  • 16,375
  • 6
  • 40
  • 68

2 Answers2

0

If you want to make more requests at the same time, you don't want to use requests, but AIOHTTP instead.

The package allows you to make HTTP requests asynchronously.

FLAK-ZOSO
  • 3,873
  • 4
  • 8
  • 28
  • 2
    could you provide an example of how to use AIOHTTP in this case? (for example, it probably doesn't make much sense for them to `await` each request in a loop, nor to `asyncio.gather()` all 5000!) – ti7 May 20 '22 at 05:37
  • Can you please provide an example of AIOHTTP? – Soham Chakraborty May 21 '22 at 06:15
0

I suggest build a multithreaded program to make requests. concurrent.futures is one of the easiest ways to multithread these kinds of requests, in particular using the ThreadPoolExecutor. They even have a simple multithreaded URL request example in the documentation.

here is a sample code using bs4 and concurrent.futures


import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

URLs = [ ... ] # A long list of URLs.

def parse(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    return soup.find_all('a')

# run 10 workers concurrently, it depends on the number of core/threads of your processor
with ThreadPoolExecutor(max_workers=10) as executor:
    start = time.time()
    futures = [ executor.submit(parse, url) for url in URLs ]
    results = []
    for result in as_completed(futures):
        results.append(result)
    end = time.time()
    print("Time Taken: {:.6f}s".format(end-start))

Also, you may want to check out python scrapy framework, it will scrape the data concurrently and very easy to learn, also it comes with many features such as auto-throttle, rotating proxies and user-agents, you can easily integrate with your databases as well.

ahmedshahriar
  • 1,053
  • 7
  • 25