0

I'm working on a webscraper that opens a webpage, and prints any links within that webpage if the link contains a keyword (I will later open these links for further scraping).

For example, I am using the requests module to open "cnn.com", and then trying to parse out all href/links within that webpage. Then, if any of the links contain a specific word (such as "china"), Python should print that link.

I could just simply open the main page using requests, save all href's onto a list ('links'), and then use:

links = [...]

keyword = "china"

for link in links:
   if keyword in link:
      print(link)

However, the problem with this method is that the links that I originally parsed out aren't full links. For example, all links with CNBC's webpage are structured like this:

href="https://www.cnbc.com/2019/08/11/how-recession-affects-tech-industry.html"

But for CNN's page, they're written like this (not full links... they're missing the part that comes before the "/"):

href="/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"

This is a problem because I'm writing more script to automatically open these links to parse them. But Python can't open

"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"

because it isn't a full link.

So, what is a robust solution to this (something that works for other sites too, not just CNN)?

EDIT: I know the links I wrote as an example in this post don't contain the word "China", but this these are just examples.

F16Falcon
  • 395
  • 1
  • 11
  • 2
    Prepend the domain? `"https://www.cnn.com" + href`. If the domain is dynamic, use a variable. – TrebledJ Aug 11 '19 at 15:38
  • @TrebledJ I was thinking of doing this, but I have like 50 different "news sites", so I can't simply prepend "https://www.cnn.com" onto all. I was wondering if there was a way for Python to automatically prepend the correct link into the hrefs? – F16Falcon Aug 11 '19 at 15:40
  • Without a minimal example of the scraper you're using, it's hard to say... I assume to scrape a site you'll call a function? Or instantiate a new class instance? Pass the domain as another parameter, and if the link is partial then prepend the domain. – TrebledJ Aug 11 '19 at 15:47
  • IIRC scrapy doesn't have this issue. They have a function which automatically follows the link. Might be worth learning. – TrebledJ Aug 11 '19 at 15:49
  • @TrebledJ Oh, I'm just using BeautifulSoup. I thought about your answer (prepending), and determined that I could do this: I could use text.partition to split ALL parsed (children) links and remove the first half (so the cnn.com part). Then, I could also partition the parent URLs to get all of the cnn.com parts, and prepend them onto my children URLs. Since I have a different request function for each parent URL, this should work. I appreciate the help man, if you'd like, you can answer this question using this method and I can vote up + mark as best answer. :) – F16Falcon Aug 11 '19 at 15:55
  • 1
    Hiw to make absolute links easily: https://stackoverflow.com/questions/44001007/scrape-the-absolute-url-instead-of-a-relative-path-in-python – rici Aug 12 '19 at 05:14

1 Answers1

1

Try using the urljoin function from the urllib.parse package. It takes two parameters, the first is the URL of the page you're currently parsing, which serves as the base for relative links, the second is the link you found. If the link you found starts with http:// or https://, it'll return just that link, else it will resolve URL relative to what you passed as the first parameter.

So for example:

#!/usr/bin/env python3

from urllib.parse import urljoin

print(
  urljoin(
    "https://www.cnbc.com/",
    "/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
  )
)
# prints "https://www.cnbc.com/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"

print(
  urljoin(
    "https://www.cnbc.com/",
    "http://some-other.website/"
  )
)
# prints "http://some-other.website/"
saintamh
  • 157
  • 1
  • 2
  • 5