this is my first question here. Recently I went through a course for web scraping and wanted to do something on my own but here I am stuck. So here is the question:
I have 120k urls in a file. The urls look something like this www.example.com/.../3542/../may/.. So we have a total of 10 000 combinations (0000-9999) multiplied by the 12 months which makes 120 000 links.
I saw that some of them return HTTP ERROR 500, some of them redirect to a designated page and the rest of them should be the ones I need but I am struggling to filter the ones I don't need.
I tried using urllib.request.openurl(url) in a try catch block to filter the http 500 code. Also used BeautifulSoup to retrieve the title of the webpage and check if it matches the page I'm being redirected to. However this seems really slow.
I tried filtering by status code wiht 'requests' but this is not fast either.
And this is a part of the code that I was talking about above:
# fname is a file handle
for line in fname:
try:
f = urllib.request.urlopen(line)
soup = BeautifulSoup(f.read().decode(), 'html.parser')
title = soup.title.string
if title != "Redirected Title":
filtered_links.write(line)
except:
pass
I'm wondering if somehow accessing the title only is faster and how to achieve it.
Thank you for you time and be free to share some knowledge either about a fix or a different approach.