I'm working on a webscraper that opens a webpage, and prints any links within that webpage if the link contains a keyword (I will later open these links for further scraping).
For example, I am using the requests module to open "cnn.com", and then trying to parse out all href/links within that webpage. Then, if any of the links contain a specific word (such as "china"), Python should print that link.
I could just simply open the main page using requests, save all href's onto a list ('links'), and then use:
links = [...]
keyword = "china"
for link in links:
if keyword in link:
print(link)
However, the problem with this method is that the links that I originally parsed out aren't full links. For example, all links with CNBC's webpage are structured like this:
href="https://www.cnbc.com/2019/08/11/how-recession-affects-tech-industry.html"
But for CNN's page, they're written like this (not full links... they're missing the part that comes before the "/"):
href="/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
This is a problem because I'm writing more script to automatically open these links to parse them. But Python can't open
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
because it isn't a full link.
So, what is a robust solution to this (something that works for other sites too, not just CNN)?
EDIT: I know the links I wrote as an example in this post don't contain the word "China", but this these are just examples.