I have a .txt file that contains the complete URLs to a number of pages that each contain a table I want to scrape data off of. My code works for one URL, but when I try to add a loop and read in the URLs from the .txt file I get the following error
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: ?
Here's my code
from urllib2 import urlopen
from bs4 import BeautifulSoup as soup
with open('urls.txt', 'r') as f:
urls = f.read()
for url in urls:
uClient = urlopen(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr", {"class":"data"})
for container in containers:
unform_name = container.findAll("th", {"width":"30%"})
name = unform_name[0].text.strip()
unform_delegate = container.findAll("td", {"id":"y000"})
delegate = unform_delegate[0].text.strip()
print(name)
print(delegate)
f.close()
I've checked my .txt file and all the entries are normal. They start with HTTP: and end with .html. There are no apostrophes or quotes around them. I'm I coding the for loop incorrectly?
Using
with open('urls.txt', 'r') as f:
for url in f:
print(url)
I get the following
??http://www.thegreenpapers.com/PCC/AL-D.html
http://www.thegreenpapers.com/PCC/AL-R.html
http://www.thegreenpapers.com/PCC/AK-D.html
And so forth on 100 lines. Only the first line has question marks. My .txt file contains those URLs with only the state and party abbreviation changing.