-1

I am trying to scrap all valid web links from a web page using Beautiful Soup in Python. I took a sample snippet of code from here retrieve links from web page using python and BeautifulSoup and it works fine. But the problem is, I could not filter the valid links from the response.

Example: When I try to scrap the link https://stackoverflow.com/questions/, it is returning lot of links as response.

Response:

['/users', '/tags/', '/ask'....]

But originally from StackOverflow, https://stackoverflow.com/questions/**users** and https://stackoverflow.com/questions/**tags** are not valid URLs. It is returning 404 when I hit from the browser. But https://stackoverflow.com/questions/**ask** is valid URL.

So, how can I filter the valid links among all links or if possible, how can I scrap only the valid links. I am not trying to scrap Stackoverflow, it just an example for this post.

I also tried hitting the URL with GET request to check the status code of the URL using the below code.

status_code = requests.get('https://stackoverflow.com/questions/questions/tagged/flutter').status_code

Above code is returning 404 which is good, but in few websites, they are automatically redirecting to valid URLs by trimming the URL after the base domain which returns 200 status code always.

Thanks for the help in advance.

Prakash
  • 591
  • 3
  • 9
  • 28
  • So, what's your question? – baduker Aug 24 '23 at 08:38
  • By the way, these `['/users', '/tags/', '/ask'....]` are your paths, but the authority is `stackoverflow.com`, so valid links should be `https://stackoverflow/tags` and `https://stackoverflow.com/ask`, for example. Not, as you have, `https://stackoverflow.com/questions/tags` etc. Read about [realative vs. absolute links here](https://stackoverflow.com/questions/2005079/absolute-vs-relative-urls). – baduker Aug 24 '23 at 08:46

0 Answers0