3

I have a script parsing a list with thousands of URLs. But my problem is, that it would take ages to be done with that list.

The URL request takes around 4 seconds before the page is loaded and can be parsed.
Is there any way to parse really a large amount of URLs fast?

My code looks like this:

from bs4 import BeautifulSoup   
import requests                 

#read url-list
with open('urls.txt') as f:
    content = f.readlines()
# remove whitespace characters
content = [line.strip('\n') for line in content]
 
#LOOP through urllist and get information
for i in range(5):
    try:
        for url in content:
        
            #get information
            link = requests.get(url)
            data = link.text
            soup = BeautifulSoup(data, "html5lib")

            #just example scraping
            name = soup.find_all('h1', {'class': 'name'})

EDIT: how to handle Asynchronous Requests with hooks in this example? I tried the following as mentioned on this site Asynchronous Requests with Python requests:

from bs4 import BeautifulSoup   
import grequests

def parser(response):
    for url in urls:
        
        #get information
        link = requests.get(response)
        data = link.text
        soup = BeautifulSoup(data, "html5lib")

        #just example scraping
        name = soup.find_all('h1', {'class': 'name'})

#read urls.txt and store in list variable
with open('urls.txt') as f:
    urls= f.readlines()
# you may also want to remove whitespace characters 
urls = [line.strip('\n') for line in urls]

# A list to hold our things to do via async
async_list = []

for u in urls:
    # The "hooks = {..." part is where you define what you want to do
    # 
    # Note the lack of parentheses following do_something, this is
    # because the response will be used as the first argument automatically
    rs = grequests.get(u, hooks = {'response' : parser})

    # Add the task to our list of things to do via async
    async_list.append(rs)

# Do our list of things to do via async
grequests.map(async_list, size=5)

This doesn't work for me. I don't even get any error in the console, it is just running for long time until it stops.

greybeard
  • 2,249
  • 8
  • 30
  • 66
kratze
  • 186
  • 2
  • 11
  • 4
    The documentation is your friend: http://docs.python-requests.org/en/v0.10.6/user/advanced/#asynchronous-requests – Tomalak Sep 08 '17 at 13:19
  • I would suggest to break your URL list and put time gaps between requests, exactly what @Tomalak suggests – chad Sep 08 '17 at 13:20
  • @Tomalak you should make that an answer cause it solves the user's issue at a first problem. – Horia Coman Sep 08 '17 at 13:22
  • For faster parsing, use `"lxml"` instead of `"html5lib"`. [See here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) – MD. Khairul Basar Sep 08 '17 at 13:23
  • Thank you very much, could you explain how this exactly works? Now i am iterating through my url list and for each url i am making a request, when the request ist made then i parse the page and store result in a database, then the loop makes the next request. Is this method possible to combine then? – kratze Sep 08 '17 at 13:28
  • 1
    The documentation describes how to use event hooks. Just put your content-handling code into a function and hock that to the `response` event. This works exactly the same way for `requests.get()` and `async.get()`. – Tomalak Sep 08 '17 at 13:36
  • 1
    I edited the code, is that the way how to use it? Sorry i find the documentation for beginners a bit poor. But i also noticed, that this isn't available in Python3. I wrote my whole script with Python3 – kratze Sep 08 '17 at 14:51
  • 1
    Not quite, see here for working code: https://stackoverflow.com/questions/9110593/asynchronous-requests-with-python-requests. As per the comments to the second answer in the same thread, you can use [async io for HTTP](https://pypi.python.org/pypi/aiohttp) on Python3. This blog post seems to address that in detail: https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html – Tomalak Sep 08 '17 at 14:56
  • This looks promising and understandable, thank you ! – kratze Sep 08 '17 at 14:59
  • 1
    Sorry for not writing that as an answer, but it would take me more time than I have today to write a useful one. I'm confident you can transform the code in the blog post to a solution - once you've figured it out, please post your own answer. I'll stick around and upvote. – Tomalak Sep 08 '17 at 15:02

0 Answers0