0

I made a script to download a few pages from a server using BeautifulSoup. I am writing the output to a .csv file. I am using python 2.7.2

I get the following error at some point:

Traceback (most recent call last):
  File "parser.py", line 114, in <module>
    c.writerow([title,description,price,weight,category,subcategory])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 61: ordinal not   in range(128)

The page I am downloading from (I checked the exact page) doesn't seem to have any weird characters.

I tried some of the solutions from the similar questions. I tried decoding like this:

content.decode('utf-8','ignore')

but it did not work.

As pointed out in Python and BeautifulSoup encoding issues . I checked the website source and it doesn't have any specified meta data either. I also tried using the ''chardet'' as suggested in How to download any(!) webpage with correct charset in python? however the urlread() method doesn't seem to work. I tried with urlopen() instead and it crashed.

How can I proceed with this?

Community
  • 1
  • 1
Dynelight
  • 2,072
  • 4
  • 25
  • 50

2 Answers2

3

BeautifulSoup gives you unicode, so to write this to the file you need to encode the data:

content.encode('utf8')

Do this before passing the data to the csv .writerow() method. There is no need to add 'ignore' here because UTF-8 can encode all of Unicode. Your full line could be:

c.writerow([e.encode('utf8') for e in (title, description, price, weight, category, subcategory)])

using a list comprehension to encode each element in turn.

If you need to manipulate the strings first, turn the NavigableString objects to unicode objects first:

unicode(description)

Alternatively, instead of encoding each column value, use the UnicodeWriter class included in the csv module examples section to have your data encoded automatically.

HTML can often use characters like em-dashes or non-breaking spaces that are not encodable to ASCII, and you won't pick those out with a quick visual scan of the page.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Hmm. That didn't seem to do it. I get the message Traceback (most recent call last): File "parser.py", line 94, in description = description.encode('utf8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 529: ordinal not in range(128) right from the start. I am adding the encoding like this: description_aux = page.find(...) description_aux = str(description_aux) description = description_aux.replace('\n', '').replace('\r', '') description = description.encode('utf8') – Dynelight Apr 27 '13 at 10:01
  • I tried with your new edit as well: Traceback (most recent call last): File "parser.py", line 119, in c.writerow([e.encode('utf8') for e in (title, description, price, weight, category, subcategory)]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 529: ordinal not in range(128) – Dynelight Apr 27 '13 at 10:06
  • @meOnomNom: so description is *already* encoded at that point. `str()` *encodes* the string, it does the same thing as `.encode()` but with the default codec. – Martijn Pieters Apr 27 '13 at 10:24
  • The problem is that I need to use str(...) in order to use the .replace() functions. Unless you have any other ideas? =( – Dynelight Apr 27 '13 at 10:31
  • That worked! Should I edit the question to say that I was using str() too? – Dynelight Apr 27 '13 at 10:36
  • If you want to; it does show that giving us proper tracebacks and the code that originated that traceback is important. :-) – Martijn Pieters Apr 27 '13 at 10:41
1

It seems like the contents of the page have been successfully parsed into a unicode object, but that the CSV writer is implicitly trying to convert back to str and therefore throwing the above error. As UTF-8 will work for any character, you can hopefully use the following:

c.writerow([title.encode("UTF-8"),description.encode("UTF-8"),price.encode("UTF-8"),weight.encode("UTF-8"),category.encode("UTF-8"),subcategory.encode("UTF-8")])

If that doesn't work, then you could try to debug it further by finding out exactly what format the data is in at that point. You can do this by writing the string representations of each variable to the CSV file, rather than the strings themselves, as follows:

c.writerow([repr(title),repr(description),repr(price),repr(weight),repr(category),repr(subcategory)])

Then you can look in the CSV file, and you might see rows like:

"abc","def",u"\u00A0123","456","abc","def"

You can then paste any tricky looking strings (such as u"\u00A0123") into a python window and play around with them directly, trying different ways of encoding and decoding.

Joe Sarre
  • 348
  • 2
  • 8
  • That didn't work. I also tried that. I got the following message: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 529: ordinal not in range(128) – Dynelight Apr 27 '13 at 09:56
  • @meOnomNom: What part did you try? Are you doing any concatenation? What is the traceback of the error? – Martijn Pieters Apr 27 '13 at 09:57
  • Traceback (most recent call last): File "parser.py", line 118, in c.writerow([title,description.encode("UTF-8"),price,weight,category,subcategory]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 529: ordinal not in range(128) – Dynelight Apr 27 '13 at 10:04
  • Your method does work, but my problem was that I was using a str() function too, so the data was being encoded twice. – Dynelight Apr 27 '13 at 10:37