2

EDIT: I appreciate that my question may have been answered previously, though I'm not sure that it was answered in the linked post. Regardless, what I'm trying to say is that I'm so new to this that even if the answer is there, I don't quite understand how to apply it to my situation. If anyone can help make it a little more explicit, I'd appreciate it.

I've got literally less than 24 hours of Python experience, so I hope you'll overlook my inability to apply advice from other threads to my own question.

I'm using BeautifulSoup to pull down a table, but I'm getting weird results. I'm inputting the following:

# -*- coding: utf-8 -*-
#Fetch table

import sys
import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.nrc.gov/reactors/operating/list-power-reactor-units.html')

html_soup = BeautifulSoup(r.text)

table = html_soup.find('table')
for row in table.find_all('tr')[1:]:
    col = row.find_all('td')
    print col[0].find('a').get('href') # link
    print col[0].find('a').contents[0] # name
    print col[1].string
    print col[2].string
    print col[3].string
    print col[4].string

I'm almost certain I've watched someone else use the exact same code without a problem, but I'm getting crap. When I run it from the command prompt, it prints the majority of the table data correctly until it reaches an en dash about two-thirds of the way through, at which point it gives me this:

Traceback (most recent call last):
  File "C:\Python27\test\scrape.py", line 18, in <module>
    print col[3].string
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 31-32: character maps to <undefined>

I thought I could just run from IDLE to avoid the character limitations in the command prompt, but I just get a different error. There, it only grabs the first link, and then it spits out this:

Traceback (most recent call last):
  File "C:\Python27\test\scrape.py", line 15, in <module>
    print col[0].find('a').contents[0] # name
  File "C:\Python27\lib\idlelib\PyShell.py", line 1344, in write
    s = unicode.__getslice__(s, None, None)
TypeError: an integer is required

I've also tried using BS4 to replace the en dashes with hyphens before spitting them out, but I can't even get that replacement to work in a totally stripped-down command. When I use:

# -*- coding: cp1252 -*-
from bs4 import BeautifulSoup

soup = BeautifulSoup("–")
soup.find(text="–").replaceWith("--")
print soup

I get this:

Traceback (most recent call last):
  File "C:\Python27\test\replace.py", line 5, in <module>
    soup.find(text="–").replaceWith("--")
  File "C:\Python27\lib\site-packages\bs4\element.py", line 1159, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "C:\Python27\lib\site-packages\bs4\element.py", line 1180, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "C:\Python27\lib\site-packages\bs4\element.py", line 484, in _find_all
    strainer = SoupStrainer(name, attrs, text, **kwargs)
  File "C:\Python27\lib\site-packages\bs4\element.py", line 1446, in __init__
    self.text = self._normalize_search_value(text)
  File "C:\Python27\lib\site-packages\bs4\element.py", line 1457, in _normalize_search_value
    return value.decode("utf8")
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte

That error at least comes back the same from both the command line and IDLE, but I'm not really sure whether that's better or worse.

I've been screwing around with this for a few hours, and although I've found several threads where people are talking about UnicodeEncode/DecodeErrors, I don't really have a strong enough grasp of the language yet to figure out how to apply those answers to this.

Thanks for the help.

bdb484
  • 161
  • 1
  • 11
  • So I saw this post that is being marked as a duplicate, but it hasn't gotten me where I need to be. It says "Setting the PYTHONIOENCODING environment variable as described above can be used to suppress the error messages," but as far as I can tell, it doesn't actually desribe how to set that variable. – bdb484 Jul 12 '14 at 22:20
  • The advice in the "Wrapping sys.stdout into an instance of StreamWriter" section, which the answer says to use, also doesn't seem to work. At least when I try it, it prints the wrong characters, e.g. `û` for `u'\u2013'`, which should be an en-dash. I'm looking into the other answers, and I might undupe this question. – user2357112 Jul 12 '14 at 22:42

0 Answers0