from bs4 import BeautifulSoup
import codecs
import sys
import urllib.request
site_response= urllib.request.urlopen("http://site/")
html=site_response.read()
file = open ("cars.html","wb") #open file in binary mode
file.write(html)
file.close()
soup = BeautifulSoup(open("cars.html"))
output = (soup.prettify('latin'))
#print(output) #prints whole file for testing
file_output = open ("cars_out.txt","wb")
file_output.write(output)
file_output.close()
fulllist=soup.find_all("div", class_="row vehicle")
#print(fulllist) #prints each row vehicle class for debug
for item in fulllist:
item_print=item.find("span", class_="modelYearSort").string
item_print=item_print + "|" + item.find("span", class_="mmtSort").string
seller_phone=item.find("span", class_="seller-phone")
print(seller_phone)
# item_print=item_print + "|" + item.find("span", class_="seller-phone").string
item_print=item_print + "|" + item.find("span", class_="priceSort").string
item_print=item_print + "|" + item.find("span", class_="milesSort").string
print(item_print)
I have the code above, it parses some html code and generates a pipe delineated file . it works fine except for there are a few entries where one of the elements (seller-phone) is missing from the html code. Not all entries have a seller phone number.
item.find("span", class_="seller-phone").string
I get a failure here. I am not surprised that line fails when seller-phone is missing. I get 'AttributeError' NoneType object has not attribute string.
I am able to do 'item.find' without the '.string' and get back the full block of html. But I can not figure out how to extract the text for those cases.