How to run parallelly the same function in python

Question

I am running one code where the python script is basically scraping through a list of links.
But the process is too slow.I want to divide my code into several process where the code is simultaneously scrapping through multiple links at once.

List of Links is almost 5000.

Here is my code which I want to run in parallel

#links contains list of links 
def fun():
    for link in links:
        requests.get(link,timeout=5)
        ###... scraping code
        #####

Does this answer your question? [How can I use threading in Python?](https://stackoverflow.com/questions/2846653/how-can-i-use-threading-in-python) — luk2302, May 19 '22 at 13:33
And https://stackoverflow.com/questions/20548628/how-to-do-parallel-programming-in-python — luk2302, May 19 '22 at 13:33
I have tried using multiprocessing module not working for my case — Soham Chakraborty, May 19 '22 at 13:34

score 0 · Answer 1 · answered May 19 '22 at 13:35

0

If you want to make more requests at the same time, you don't want to use requests, but AIOHTTP instead.

The package allows you to make HTTP requests asynchronously.

answered May 19 '22 at 13:35

FLAK-ZOSO

3,873
4
8
28

2

could you provide an example of how to use AIOHTTP in this case? (for example, it probably doesn't make much sense for them to `await` each request in a loop, nor to `asyncio.gather()` all 5000!) – ti7 May 20 '22 at 05:37
Can you please provide an example of AIOHTTP? – Soham Chakraborty May 21 '22 at 06:15

score 0 · Answer 2 · answered May 20 '22 at 05:27

I suggest build a multithreaded program to make requests. concurrent.futures is one of the easiest ways to multithread these kinds of requests, in particular using the ThreadPoolExecutor. They even have a simple multithreaded URL request example in the documentation.

here is a sample code using bs4 and concurrent.futures


import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

URLs = [ ... ] # A long list of URLs.

def parse(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')
    return soup.find_all('a')

# run 10 workers concurrently, it depends on the number of core/threads of your processor
with ThreadPoolExecutor(max_workers=10) as executor:
    start = time.time()
    futures = [ executor.submit(parse, url) for url in URLs ]
    results = []
    for result in as_completed(futures):
        results.append(result)
    end = time.time()
    print("Time Taken: {:.6f}s".format(end-start))

Also, you may want to check out python scrapy framework, it will scrape the data concurrently and very easy to learn, also it comes with many features such as auto-throttle, rotating proxies and user-agents, you can easily integrate with your databases as well.

while this may work, using `concurrent.futures` in general, skipping (or indeed setting) the request timeout without extra Exception handling, etc. is fraught with perils! if you've used Scrapy before, it might make a good second Answer! — ti7, May 20 '22 at 05:40
I agree, for this type of use case I personally use scrapy or similar libraries — ahmedshahriar, May 20 '22 at 05:43
It still is working sequentially which makes the code run same as before — Soham Chakraborty, May 21 '22 at 07:20

How to run parallelly the same function in python

2 Answers2