0

I am using the requests module to download the content of many websites, which looks something like this:

import requests
for i in range(1000):
    url = base_url + f"{i}.anything"
    r = requests.get(url)

Of course this is simplified, but basically the base url is always the same, I only want to download an image, for example. This takes very long due to the amount of iterations. The internet connection is not the problem, but rather the amount of time it takes to start a request etc. So I was thinking about something like multiprocessing, because this task is basically always the same and I could imagine it to be a lot faster when multiprocessed.

Is this somehow doable? Thanks in advance!

Luca Tatas
  • 160
  • 1
  • 14
  • This seems like a pretty generic multiprocessing question. Try starting with the following answers. https://stackoverflow.com/a/28463266/15013600 https://stackoverflow.com/a/26521507/15013600 – jisrael18 May 04 '21 at 16:18

1 Answers1

1

I would suggest that in this case, the lightweight thread would be better. When I ran the request on a certain URL 5 times, the result was:

Threads: Finished in 0.24 second(s)
MultiProcess: Finished in 0.77 second(s)

Your implementation can be something like this:

import concurrent.futures
import requests
from bs4 import BeautifulSoup
import time

def access_url(url,No):
    print(f"{No}:==> {url}")
    response=requests.get(url)
    soup=BeautifulSoup(response.text,features='lxml')
    return ("{} :  {}".format(No, str(soup.title)[7:50]))

if __name__ == "__main__":
    test_url="http://bla bla.com/"
    base_url=test_url
    THREAD_MULTI_PROCESSING= True
    start = time.perf_counter() # calculate the time
    url_list=[base_url for i in range(5)] # setting parameter for function as a list so map can be used.
    url_counter=[i for i in range(5)] # setting parameter for function as a list so map can be used.
    if THREAD_MULTI_PROCESSING:
        with concurrent.futures.ThreadPoolExecutor() as executor: # In this case thread would be better
            results = executor.map(access_url,url_list,url_counter)
        for result in results:
            print(result)
    end = time.perf_counter() # calculate finish time
    print(f'Threads: Finished in {round(end - start,2)} second(s)')

    start = time.perf_counter()
    PROCESS_MULTI_PROCESSING=True
    if PROCESS_MULTI_PROCESSING:
        with concurrent.futures.ProcessPoolExecutor() as executor:
            results = executor.map(access_url,url_list,url_counter)
        for result in results:
            print(result)
    end = time.perf_counter()
    print(f'Threads: Finished in {round(end - start,2)} second(s)')

I think you will see better performance in your case.

Dharman
  • 30,962
  • 25
  • 85
  • 135
simpleApp
  • 2,885
  • 2
  • 10
  • 19