How to parse the response from Grequests faster?

Question

I want to webscraping multiple urls and parse quick as possible but the for loop is not too faster for me, have a way to do this maybe with asynchronous or multiprocessing or multithreading?

import grequests
from bs4 import BeautifulSoup


links1 = [] #multiple links


while True:
  try:  
 
   reqs = (grequests.get(link) for link in links1)
   resp = grequests.imap(reqs, size=25, stream=False)
  

   for r in resp:     # I WANT TO RUN THIS FOR LOOP QUICK AS POSSIBLE ITS POSSIBLE? 
    soup = BeautifulSoup(r.text, 'lxml') 
    parse = soup.find('div', class_='txt')

Are the HTML documents big? Parsing can be time consuming, so `multiprocessing` can help. — Andrej Kesely, Sep 11 '21 at 19:14
Yes. but i dont know how to implement multiprocessing on code (note: im new on coding in python ) — JONH, Sep 12 '21 at 09:02
I've added simple example how to use `multiprocessing.Pool` with `beautifulsoup` — Andrej Kesely, Sep 12 '21 at 09:18

Andrej Kesely · Accepted Answer · 2021-09-12T10:38:22.583

0

Example how to use multiprocessing with requests/BeautifulSoup:

import requests
from tqdm import tqdm  # for pretty progress bar
from bs4 import BeautifulSoup
from multiprocessing import Pool

# some 1000 links to analyze
links1 = [
    "https://en.wikipedia.org/wiki/2021_Moroccan_general_election",
    "https://en.wikipedia.org/wiki/Tangerang_prison_fire",
    "https://en.wikipedia.org/wiki/COVID-19_pandemic",
    "https://en.wikipedia.org/wiki/Yolanda_Fern%C3%A1ndez_de_Cofi%C3%B1o",
] * 250


def parse(url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    return soup.select_one("h1").get_text(strip=True)


if __name__ == "__main__":
    with Pool() as p:
        out = []
        for r in tqdm(p.imap(parse, links1), total=len(links1)):
            out.append(r)

    print(len(out))

With my internet connection/CPU (Ryzen 3700x) I was able to get results from all 1000 links in 30 seconds:

100%|██████████| 1000/1000 [00:30<00:00, 33.12it/s]
1000

all my CPUs were utilized (screenshot from htop):

edited Sep 12 '21 at 10:38

answered Sep 12 '21 at 09:17

Andrej Kesely

168,389
15
48
91

i executed the same code, but spawn a bunch of erros and they keep spawning. – JONH Sep 12 '21 at 09:32
@JONH Which kind of errrors? Do you use the exact code with the same links as in my code? – Andrej Kesely Sep 12 '21 at 09:38
yes, "an attempet has been made to start a new process before the current process has finished its bootstrapping pharse, This probably means that you are not using fork to start your child processes and you have forgotten to usethe proper idiom in the main module: – JONH Sep 12 '21 at 10:34
@JONH Try this: https://stackoverflow.com/questions/55057957/an-attempt-has-been-made-to-start-a-new-process-before-the-current-process-has-f I've updated my answer. – Andrej Kesely Sep 12 '21 at 10:37

How to parse the response from Grequests faster?

1 Answers1