I'm trying to build a web crawler using the requests
module,
and basically what I want it to do is go to a webpage, get all the href
's and then write them to a text file.
So far my code looks like this:
def getLinks(url):
response = requests.get(url).text
soup = BeautifulSoup(response,"html.parser")
for link in soup.findAll("a"):
print("Link:"+str(link.get("href")))
which work on some sites
but the one that I'm trying to use it on the href
's isn't full domain names like "www.google.com" instead they're like...paths to a directory that redirects to the link?
looks like this:
href="/out/101"
and if i try to write that in to a file it looks like this
1. /out/101
2. /out/102
3. /out/103
4. /out/104
which isn't really what I wanted.
soo how do I go about getting the domain names from these links?