-1

I keep getting a traceback error saying AttributeError: 'NoneType' object has no attribute 'startswith' when I get to the end of my script. What I am doing up to this point is scraping all kinds of different pages then pulling all these different pages into one list that scrapes the final URL for each business page. What I did was go to each_page and scrape all the 'a' tags off of the page, then I am wanting to search through them and only keep the ones that start with '/401k/'. I know I could probably do it without having to add it to another list because I feel like I have too many. I was thinking of doing it like this:

for a in soup.findAll('a'):
    href = a.get('href')
    if href.startswith('/401k/'):
        final_url.append(href)
        #Even when I try this I get an error saying that no attribute 

Either way it isn't getting the data and I cant figure out what is going on. Maybe I've been looking at the screen too much.

import requests
from bs4 import BeautifulSoup

url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
pages = []
s_names = []
final_url = []

for href in soup.findAll('a'):
    if 'href' in href.attrs:
        hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
    if good_ratings.startswith('/ratings/'):
        ratings.append(url[:-9]+good_ratings)

del ratings[0]
del ratings[27:]

for each_rating in ratings:
    page = requests.get(each_rating)
    soup = BeautifulSoup(page.text, 'html.parser')
    span = soup.find('span', class_='letter-pages')
    if soup.find('span', class_='letter-pages'):
        for a in span.find_all('a'):
            href = a.get('href')
            pages.append('https://www.brightscope.com'+href)
    else:
        pages.append(page.url)
hrefs = []
pages = set(pages)
for each_page in pages:
    page = requests.get(each_page)
    soup = BeautifulSoup(page.text, 'html.parser')
    for a in soup.findAll('a'):
        href = a.get('href')
        s_names.append(href)
    # I am getting a traceback error AttributeError: 'NoneType' object has no attribute 'startswith' starting with the code below.
    for each in s_names:
        if each.startswith('/401k'):
            final_url.append(each)
Kamikaze_goldfish
  • 856
  • 1
  • 10
  • 24

2 Answers2

1

a tags can have no href in html 5 so a.get('href') returns None. that's probably what you're experiencing.
What you need is to make sure you don't get None:

for a in soup.findAll('a'):
href = a.get('href')
if href is not None:
    s_names.append(href)

See here for more details https://www.w3.org/TR/2016/REC-html51-20161101/textlevel-semantics.html#the-a-element

If the a element has no href attribute, then the element represents a placeholder for where a link might otherwise have been placed, if it had been relevant, consisting of just the element’s contents.

Hagai
  • 678
  • 7
  • 20
  • This makes more sense! We can't expect every `a` tag to have `href`, so making the script to handle the exception is a better option. @Hagai, i would suggest to use `try except` instead of `if else` as it consumes lesser time comparatively. (reference:https://stackoverflow.com/a/7604717/3488550) – SanthoshSolomon Aug 26 '18 at 07:49
  • @SmashGuy how would you use try except cleanly here? nothing throws exception here... – Hagai Aug 26 '18 at 08:03
  • I mentioned a custom exception here. If an `a` tag happens to enter without `href`, isn't that an exception to be thrown? – SanthoshSolomon Aug 26 '18 at 08:05
  • @SmashGuy as far as I know `a.get('href')` will return None, and will not throw an exception. You can raise an exception yourself, but it probably complicates the code un-necessarily. – Hagai Aug 26 '18 at 08:06
  • I did not suggest that for your code. I suggested for @Kamikaze_goldfish code like including .startswith() within `try except` block. – SanthoshSolomon Aug 26 '18 at 08:11
1

The problem you are facing is because you are trying to use the startswith operator irrespective of whether the value is present or not. You should first check if the each variable is having value or not. Try this

for each in s_names:
    if each and each.startswith('/401k'):
        final_url.append(each)

What the above statement is doing is, first it is checking if the value is None or not . Then if the value is not None then it is moving forward to make the check using startswith

Arghya Saha
  • 5,599
  • 4
  • 26
  • 48
  • I think the real question is why on the first place he's adding `None`s to this list – Hagai Aug 26 '18 at 05:40
  • @Hagai Yes that is one point. But again it is completely upto him on that logic. It might be possible that he might be wanting to add all the `a` tags. My solution focuses on solving the error which he is getting and not changing his base logic. On a broader aspect the OP should know how to fix such issue and while writing code he should be able to write code which can handle such situation. With the approach you suggested, he would be able to solve it, but would again be stuck on some other `'NoneType' object has no attribute` problem introduced due to some other logic – Arghya Saha Aug 26 '18 at 05:48
  • @argo Great movie btw - Argo. It looks like this does work well. I am running it right now and it hasn't given me any errors. – Kamikaze_goldfish Aug 26 '18 at 20:58
  • @argo I ran this code and I didn't get a traceback error. I did, however, grab ALL the links even ones that didn't start with `/401k/` – Kamikaze_goldfish Aug 26 '18 at 21:25
  • All the links in final_url or in hrefs? – Arghya Saha Aug 27 '18 at 03:11