Going through 120k web pages (urllib, requests seem slow)

Question

this is my first question here. Recently I went through a course for web scraping and wanted to do something on my own but here I am stuck. So here is the question:

I have 120k urls in a file. The urls look something like this www.example.com/.../3542/../may/.. So we have a total of 10 000 combinations (0000-9999) multiplied by the 12 months which makes 120 000 links.

I saw that some of them return HTTP ERROR 500, some of them redirect to a designated page and the rest of them should be the ones I need but I am struggling to filter the ones I don't need.

I tried using urllib.request.openurl(url) in a try catch block to filter the http 500 code. Also used BeautifulSoup to retrieve the title of the webpage and check if it matches the page I'm being redirected to. However this seems really slow.

I tried filtering by status code wiht 'requests' but this is not fast either.

And this is a part of the code that I was talking about above:

# fname is a file handle
for line in fname:
    try:
        f = urllib.request.urlopen(line)
        soup = BeautifulSoup(f.read().decode(), 'html.parser')
        title = soup.title.string
        if title != "Redirected Title":
            filtered_links.write(line)
    except:
        pass

I'm wondering if somehow accessing the title only is faster and how to achieve it.

Thank you for you time and be free to share some knowledge either about a fix or a different approach.

Similar https://stackoverflow.com/questions/63841475/how-to-improve-the-webscrapping-code-speed-by-multi-threading-code-python#comment112894658_63841475 — snakecharmerb, Sep 12 '20 at 21:24
Perhaps you could take an asynchronous approach with asyncio and [aiohttp](https://docs.aiohttp.org/en/stable/). — Detlef, Sep 12 '20 at 21:27

score 0 · Answer 1 · answered Sep 16 '20 at 06:40

I've recently done a brute force challenge that included vast amount of requests. I've used Parallelism approach from here and I could run 40 requests at once (each package took about 2 seconds). You can change number of requests as you wish depending on your connection speed.

from requests_futures import sessions
from concurrent.futures import ThreadPoolExecutor

urls = []         #add your list of urls here
session = sessions.FuturesSession(executor=ThreadPoolExecutor(max_workers=40))
#change max_workers as you wish

futures = [session.get(url) for url in urls]

results = [f.result().url for f in futures if f.result().status_code is 200]
# results will give you the url of requests that was successful(200 code)

print(f"Results: {results}")

Going through 120k web pages (urllib, requests seem slow)

1 Answers1