Trying to create a webpage scraper. Getting MissingSchema: Invalid URL error.
here is the code I have:
import requests
from bs4 import BeautifulSoup as bs
class HrefScraper(object):
def __init__(self,url):
self.url = url
def requestUrl(self):
getpage = requests.get(self.url)
return getpage.content
def hrefParser(self):
_list = []
soup = bs(self.requestUrl())
anchors = soup.findAll('a')
for items in anchors:
href = items.get('href',None)
if 'http' in href:
if url[11:] in href and href not in _list:
_list.append(href)
else:
_list.append(href )
if '//' in href and 'http' not in href:
_list.append(self.url+href)
return _list
if __name__=='__main__':
url = 'https://www.google.com'
scraper = HrefScraper(url)
scraper.requestUrl()
scraper.hrefParser()
for i in scraper.hrefParser():
Loop=HrefScraper(i)
Loop.requestUrl()
try:
for i in Loop.hrefParser():
print i
except TypeError:
pass
The script takes all the urls from a webpage and then loops thru recursively. Atleast that is the intended affect. When the script gets to a Href tag that only holds a directory without the sites address, it fails hard. I tried to create handling for it with no success. Could someone help me understand a better way to do this.
Is there a way I could:
url = request.headers['host']; request.get(url + //randomSomethingWithNoAddress)
Thanks for any and all advice :)
Edit: The Duplicate link posted above my thread isn't the answer to my question. I understand fully what OP is saying. I wrote a handler for that with "self.url + href" so no need to urljuoin.
My question is: I start off getting urls like this "http://www.example.com" I crash with urls like this "/policy" Why "/policy" and not "/press/". If you ran the script you could see what I was talking about.
Is my understanding of the script wrong? Why am I getting so many absolute paths that don't fail?