0

Trying to create a webpage scraper. Getting MissingSchema: Invalid URL error.

here is the code I have:

import requests
from bs4 import BeautifulSoup as bs

class HrefScraper(object):
def __init__(self,url):
    self.url = url

def requestUrl(self):
    getpage = requests.get(self.url)
    return getpage.content

def hrefParser(self):
        _list = []
        soup = bs(self.requestUrl())
        anchors = soup.findAll('a')      
        for items in anchors:
            href = items.get('href',None)
            if 'http' in href:
                if url[11:] in href and href not in _list:
                    _list.append(href)
            else:
                _list.append(href )
                if '//' in href and 'http' not in href:
                    _list.append(self.url+href)
        return _list


if __name__=='__main__':
    url = 'https://www.google.com'
    scraper = HrefScraper(url)
    scraper.requestUrl()
    scraper.hrefParser()
    for i in scraper.hrefParser():
        Loop=HrefScraper(i)
        Loop.requestUrl()
        try:
            for i in Loop.hrefParser():
                 print i
        except TypeError:
            pass

The script takes all the urls from a webpage and then loops thru recursively. Atleast that is the intended affect. When the script gets to a Href tag that only holds a directory without the sites address, it fails hard. I tried to create handling for it with no success. Could someone help me understand a better way to do this.

Is there a way I could:

url = request.headers['host']; request.get(url + //randomSomethingWithNoAddress)

Thanks for any and all advice :)

Edit: The Duplicate link posted above my thread isn't the answer to my question. I understand fully what OP is saying. I wrote a handler for that with "self.url + href" so no need to urljuoin.

My question is: I start off getting urls like this "http://www.example.com" I crash with urls like this "/policy" Why "/policy" and not "/press/". If you ran the script you could see what I was talking about.

Is my understanding of the script wrong? Why am I getting so many absolute paths that don't fail?

James A Mohler
  • 11,060
  • 15
  • 46
  • 72
Mr.Free
  • 1
  • 1
  • And yet when I replace... whatever your code does with a single call to `urljoin()` it works perfectly. – Ignacio Vazquez-Abrams Mar 23 '15 at 04:06
  • sweet thank you. Could you show me how you implemented it? Also since this happened, I thought I created a handle for it and found out I was wrong, is the code working like I wrote it to work. Meaning is every href url being scraped visited recursively. How would I test for my self. Thanks for your time – Mr.Free Mar 23 '15 at 04:27
  • I replaced the huge structure with `_list.append(urljoin(self.url, href))`. You'll need to add duplicate detection back in though. – Ignacio Vazquez-Abrams Mar 23 '15 at 04:28
  • thank you so much. Is there anything else you can see I am doing wrong by any chance. Is there an easier approach? – Mr.Free Mar 23 '15 at 04:32
  • I don't see the point in calling the `requestUrl()` method in the main stanza. Also, consider implementing a BFS to handle downloading and parsing pages at URLs. – Ignacio Vazquez-Abrams Mar 23 '15 at 04:41

0 Answers0