EDIT: I appreciate that my question may have been answered previously, though I'm not sure that it was answered in the linked post. Regardless, what I'm trying to say is that I'm so new to this that even if the answer is there, I don't quite understand how to apply it to my situation. If anyone can help make it a little more explicit, I'd appreciate it.
I've got literally less than 24 hours of Python experience, so I hope you'll overlook my inability to apply advice from other threads to my own question.
I'm using BeautifulSoup to pull down a table, but I'm getting weird results. I'm inputting the following:
# -*- coding: utf-8 -*-
#Fetch table
import sys
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.nrc.gov/reactors/operating/list-power-reactor-units.html')
html_soup = BeautifulSoup(r.text)
table = html_soup.find('table')
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
print col[0].find('a').get('href') # link
print col[0].find('a').contents[0] # name
print col[1].string
print col[2].string
print col[3].string
print col[4].string
I'm almost certain I've watched someone else use the exact same code without a problem, but I'm getting crap. When I run it from the command prompt, it prints the majority of the table data correctly until it reaches an en dash about two-thirds of the way through, at which point it gives me this:
Traceback (most recent call last):
File "C:\Python27\test\scrape.py", line 18, in <module>
print col[3].string
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 31-32: character maps to <undefined>
I thought I could just run from IDLE to avoid the character limitations in the command prompt, but I just get a different error. There, it only grabs the first link, and then it spits out this:
Traceback (most recent call last):
File "C:\Python27\test\scrape.py", line 15, in <module>
print col[0].find('a').contents[0] # name
File "C:\Python27\lib\idlelib\PyShell.py", line 1344, in write
s = unicode.__getslice__(s, None, None)
TypeError: an integer is required
I've also tried using BS4 to replace the en dashes with hyphens before spitting them out, but I can't even get that replacement to work in a totally stripped-down command. When I use:
# -*- coding: cp1252 -*-
from bs4 import BeautifulSoup
soup = BeautifulSoup("–")
soup.find(text="–").replaceWith("--")
print soup
I get this:
Traceback (most recent call last):
File "C:\Python27\test\replace.py", line 5, in <module>
soup.find(text="–").replaceWith("--")
File "C:\Python27\lib\site-packages\bs4\element.py", line 1159, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Python27\lib\site-packages\bs4\element.py", line 1180, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Python27\lib\site-packages\bs4\element.py", line 484, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Python27\lib\site-packages\bs4\element.py", line 1446, in __init__
self.text = self._normalize_search_value(text)
File "C:\Python27\lib\site-packages\bs4\element.py", line 1457, in _normalize_search_value
return value.decode("utf8")
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte
That error at least comes back the same from both the command line and IDLE, but I'm not really sure whether that's better or worse.
I've been screwing around with this for a few hours, and although I've found several threads where people are talking about UnicodeEncode/DecodeErrors, I don't really have a strong enough grasp of the language yet to figure out how to apply those answers to this.
Thanks for the help.