1

I want to webscraping multiple urls and parse quick as possible but the for loop is not too faster for me, have a way to do this maybe with asynchronous or multiprocessing or multithreading?

import grequests
from bs4 import BeautifulSoup


links1 = [] #multiple links


while True:
  try:  
 
   reqs = (grequests.get(link) for link in links1)
   resp = grequests.imap(reqs, size=25, stream=False)
  

   for r in resp:     # I WANT TO RUN THIS FOR LOOP QUICK AS POSSIBLE ITS POSSIBLE? 
    soup = BeautifulSoup(r.text, 'lxml') 
    parse = soup.find('div', class_='txt')

JONH
  • 17
  • 4

1 Answers1

0

Example how to use multiprocessing with requests/BeautifulSoup:

import requests
from tqdm import tqdm  # for pretty progress bar
from bs4 import BeautifulSoup
from multiprocessing import Pool

# some 1000 links to analyze
links1 = [
    "https://en.wikipedia.org/wiki/2021_Moroccan_general_election",
    "https://en.wikipedia.org/wiki/Tangerang_prison_fire",
    "https://en.wikipedia.org/wiki/COVID-19_pandemic",
    "https://en.wikipedia.org/wiki/Yolanda_Fern%C3%A1ndez_de_Cofi%C3%B1o",
] * 250


def parse(url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    return soup.select_one("h1").get_text(strip=True)


if __name__ == "__main__":
    with Pool() as p:
        out = []
        for r in tqdm(p.imap(parse, links1), total=len(links1)):
            out.append(r)

    print(len(out))

With my internet connection/CPU (Ryzen 3700x) I was able to get results from all 1000 links in 30 seconds:

100%|██████████| 1000/1000 [00:30<00:00, 33.12it/s]
1000

all my CPUs were utilized (screenshot from htop):

enter image description here

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • i executed the same code, but spawn a bunch of erros and they keep spawning. – JONH Sep 12 '21 at 09:32
  • @JONH Which kind of errrors? Do you use the exact code with the same links as in my code? – Andrej Kesely Sep 12 '21 at 09:38
  • yes, "an attempet has been made to start a new process before the current process has finished its bootstrapping pharse, This probably means that you are not using fork to start your child processes and you have forgotten to usethe proper idiom in the main module: – JONH Sep 12 '21 at 10:34
  • @JONH Try this: https://stackoverflow.com/questions/55057957/an-attempt-has-been-made-to-start-a-new-process-before-the-current-process-has-f I've updated my answer. – Andrej Kesely Sep 12 '21 at 10:37