6
from bs4 import BeautifulSoup
import codecs
import sys

import urllib.request
site_response= urllib.request.urlopen("http://site/")
html=site_response.read()
file = open ("cars.html","wb") #open file in binary mode
file.write(html)
file.close()


soup = BeautifulSoup(open("cars.html"))
output = (soup.prettify('latin'))
#print(output) #prints whole file for testing

file_output = open ("cars_out.txt","wb")
file_output.write(output)
file_output.close()

fulllist=soup.find_all("div", class_="row vehicle")
#print(fulllist) #prints each row vehicle class for debug

for item in fulllist:
    item_print=item.find("span", class_="modelYearSort").string
    item_print=item_print + "|" + item.find("span", class_="mmtSort").string
    seller_phone=item.find("span", class_="seller-phone")
    print(seller_phone)
    # item_print=item_print + "|" + item.find("span", class_="seller-phone").string
    item_print=item_print + "|" + item.find("span", class_="priceSort").string
    item_print=item_print + "|" + item.find("span", class_="milesSort").string
    print(item_print)

I have the code above, it parses some html code and generates a pipe delineated file . it works fine except for there are a few entries where one of the elements (seller-phone) is missing from the html code. Not all entries have a seller phone number.

item.find("span", class_="seller-phone").string

I get a failure here. I am not surprised that line fails when seller-phone is missing. I get 'AttributeError' NoneType object has not attribute string.

I am able to do 'item.find' without the '.string' and get back the full block of html. But I can not figure out how to extract the text for those cases.

personalt
  • 810
  • 3
  • 13
  • 26

1 Answers1

8

You're correct, soup.find returns None if there's no element found.

You can just put an if/else clause to avoid this:

for item in fulllist:
    span = item.find("span", class_="modelYearSort")
    if span:
        item_print = span.string
        item_print=item_print + "|" + item.find("span", class_="mmtSort").string
        seller_phone=item.find("span", class_="seller-phone")
        print(seller_phone)
        # item_print=item_print + "|" + item.find("span", class_="seller-phone").string
        item_print=item_print + "|" + item.find("span", class_="priceSort").string
        item_print=item_print + "|" + item.find("span", class_="milesSort").string
        print(item_print)
    else:
        continue #It's empty, go on to the next loop.

Or if you like it, use a try/except block:

for item in fulllist:
    try:
        item_print=item.find("span", class_="modelYearSort").string
    except AttributeError:
        continue #skip to the next loop.
    else:
        item_print=item_print + "|" + item.find("span", class_="mmtSort").string
        seller_phone=item.find("span", class_="seller-phone")
        print(seller_phone)
        # item_print=item_print + "|" + item.find("span", class_="seller-phone").string
        item_print=item_print + "|" + item.find("span", class_="priceSort").string
        item_print=item_print + "|" + item.find("span", class_="milesSort").string
        print(item_print)

Hope this helps!

aIKid
  • 26,968
  • 4
  • 39
  • 65
  • Thanks, this is helpful... I guess I wasnt that clear on what I wanted to do if the phone number wasnt present. I actually don't want to skip to the next item, I just want to treat it as a null so my string has || in that location. However I think I can leverage what you provided above to do that as the error handling part is where I was getting stuck. I will give it a try in a bit – personalt Dec 07 '13 at 14:20
  • Maybe just try `item_print = item.find('span', class_='modelYearSort', text=True)` instead... see if that works - that should only return the nodes that have non-empty strings to start with – Jon Clements Dec 07 '13 at 14:23
  • @Jon Mmm.. I thought the problem was because BS cannot find the span? – aIKid Dec 07 '13 at 14:31
  • @aIKid *sighs* yeah... I think I'll have another mug of coffee :) – Jon Clements Dec 07 '13 at 14:33
  • I can see the problem with find, but it is even worse with find_all() in a list comprehension because it crashes. I cannot see a way of trapping that with an except clause or even inside the comprehension. The problem came up when the developer forgot to label the last column in a table. eg [th.get_text() for th in table.find("tr").find_all("th)] where the find_all chokes on the blank name. – user3451435 Nov 18 '16 at 02:18