Requests and generator. Parallel Compute and request

Question

I am scraping blog urls from main page, and later I iterate over all urls to retrive text on it. Will generator be faster if I move loop to blogscraper and make yield some_text ? I guess app will still be one threaded and It wont request next pages while computing text from html.

Should I use asyncio? or there are some better modules to make it parrel? Create generator that yields coroutine results as the coroutines finish

I also want to make later small rest app for displaying results

def readmainpage(self):
   blogurls = []
   while(nextPage):       
       r = requests.get(url)
       ...
       blogurls += [new_url]
   return blogurls


def blogscraper(self, url):
   r = request.get(url)
   ...
   return sometext

def run(self):
    blog_list = self.readmainpage()
    for blog in blog_list:
        data = self.blogscraper(blog['url'])

In order to make it really fast you could try to use [scrapy](https://scrapy.org/) which was built for just that, fast web-scraping... — game0ver, Nov 25 '19 at 17:29
FYI it’s __scraping__ and __scraper__ not scrapping or scrapper — DisappointedByUnaccountableMod, May 13 '21 at 20:15

score 0 · Answer 1 · answered Nov 25 '19 at 17:43

Using threading package, you can run your top function (object initialitization) asynchronously. It will create sub parallel sub-process for your requests. For example, single page fetching is 2 mins and you have 10 pages. In threading, all will take 2 mins. Threading in Python 3.x

score 0 · Answer 2 · edited May 13 '21 at 20:15

With asyncio you can try to use aiohttp module: pip install aiohttp As example code it's can looks something like this, also can be done some improvements but it depends on your code...

import sys
import aiohttp
import asyncio
import socket
from urllib.parse import urlparse

class YourClass:
    def __init__(self):
        self.url = "..."
        url_parsed = urlparse( self.url )
        self.session = aiohttp.ClientSession(
            headers = { "Referer": f"{ url_parsed.scheme }://{ url_parsed.netloc }" },
            auto_decompress = True,
            connector = aiohttp.TCPConnector(family=socket.AF_INET, verify_ssl=False) )


    async def featch(self, url):
        async with self.session.get( url ) as resp:
            assert resp.status == 200
            return await resp.text()

    async def readmainpage(self):
        blogurls = []
        while nextPage:       
            r = await self.featch(self.url)
            # ...
            blogurls += [new_url]
        return blogurls


    async def blogscraper(self, url):
        r = await self.featch(url)
        return r
        # ...
        return sometext

    async def __call__(self):
        url_parsed = urlparse( self.url )
        blog_list = await self.readmainpage()
        coros = [ asyncio.Task( self.blogscraper( blog['url']) ) for blog in blog_list ]
        
        for data in await asyncio.gather( *coros ):
            print(data)

        # do not forget to close session if not using with statement
        await self.session.close()
        

def main():
    featcher = YourClass()
    loop = asyncio.get_event_loop()
    loop.run_until_complete( featcher() )
    sys.exit(0)

if __name__ == "__main__":
    main()

Requests and generator. Parallel Compute and request

2 Answers2