0

I know that this is an ever presenting problem when working with Python 2.x. I'm currently working with Python 2.7. The text content that I'm wanting to output to a tab delimited text file is being pulled from a Sql Server 2012 database table that is by has the Server Collation set to SQL_Latin1_General_CP1_CI_AS.

The exception I get tends to vary a little, but essentially is : UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 57: ordinal not in range(128)

or UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 308: ordinal not in range(128)

Now here is what I typically turn to, but still result in an error:

from kitchen.text.converters import getwriter
with open("output.txt", 'a') as myfile:
    #content processing done here
    #title is text pulled directly from database
    #just_text is content pulled from raw html inserted into beautiful soup
    #    and using its .get_text() to just retrieve the text content
    UTF8Writer = getwriter('utf8')
    myfile = UTF8Writer(myfile)
    myfile.write(text + '\t' + just_text)

I have also tried:

# also performed for just_text and still resulting in exceptions
title = title.encode('utf-8')

and

title = title.decode('latin-1')
title = title.encode('utf-8')

and

title = unicode(title, 'latin-1')

I have also replaced the with open() with:

with codecs.open("codingOutput.txt", mode='a', encoding='utf-8') as myfile:

I'm not sure what it is that I'm doing wrong, or forgetting to do. I have also swapped the encode with decode just in case I've been doing the encoding/decoding backwards. with no success.

any help would be greatly appreciated.

Update

I have added print repr(title) and print repr(just_text) and both when I first retrieved title from the database and when performing the .get_text(). Not sure how much this helps but....

for title I get: <type 'str'> for just_text I get: <type 'unicode'>

Errors

These are errors I'm getting from the content pulled from the BeautifulSoup Summary() function.

C:\Python27\lib\site-packages\bs4\dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
C:\Python27\lib\site-packages\bs4\dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
C:\Python27\lib\site-packages\bs4\dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:3] == b'\xef\xbb\xbf':
C:\Python27\lib\site-packages\bs4\dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\x00\x00\xfe\xff':
C:\Python27\lib\site-packages\bs4\dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  elif data[:4] == b'\xff\xfe\x00\x00':

ValueError: Expected a bytes object, not a unicode object

The trace back portion is:

File <myfile>, line 39, in <module>
  summary_soup = BeautifulSoup(page_summary)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 193, in __init__
  self.builder.prepare_markup(markup, from_encoding)):
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 99, in prepare_markup
  for encoding in detector.encodings:
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 256, in encodings
  self.chardet_encoding = chardet_dammit(self.markup)
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 31, in chardet_dammit
  return chardet.detect(s)['encoding']
File "C:\Python27\lib\site-packages\chardet\__init__.py", line 25, in detect
  raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object
shadonar
  • 1,114
  • 3
  • 16
  • 40
  • Does [this](http://stackoverflow.com/a/22405042/1903116) help? – thefourtheye Apr 08 '15 at 14:12
  • @thefourtheye i have just tried that and I receive `UnicodeDecodeError: 'utf16' codec can't decode byte 0x20 in position 1652: truncated data` on content that was previously working. – shadonar Apr 08 '15 at 14:20

2 Answers2

1

Here's some advice. Everything has an encoding. Your issue is just a matter of finding out the various encodings of the different portions, re-encoding them into a common format, and writing the result to a file.

I recommend choosing utf-8 as the output encoding.

f = open('output', 'w')
unistr = title.decode("latin-1") + "\t" + just_text
f.write(unistr.encode("utf-8"))

Beautiful soup's get_text returns python's unicode wrapper type. decode("latin-1") should get your database content into the unicode type, which is joined with the tab before writing bytes encoded in utf-8.

cdosborn
  • 3,111
  • 29
  • 30
  • I think this is working. I'm running it right now and the first few that were giving me problems went through successfully. If this continues to work for the entire set of things i need then I think my problem was doing too much (over thinking the problem), and not starting with the `decode('latin-1')` and then writing it as utf-8 to the file. – shadonar Apr 09 '15 at 17:55
  • I just ran into these errors, I don't know if it's related or if this would actually be a different question. look at the `Errors` section of the question. – shadonar Apr 09 '15 at 18:04
  • That does appear to have fixed the problem!! Yes! Thank you very much! – shadonar Apr 09 '15 at 19:07
0

The issue is that you mix bytes and Unicode text:

>>> u'\xe9'.encode('utf-8') + '\t' + u'x'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

where u'\xe9'.encode('utf-8') is a bytestring that encodes é character (U+00e9) using utf-8 encoding. And u'x' is a Unicode text that contains x character (U+0078).

The solution is to use Unicode text:

>>> print u'\xe9' + '\t' + u'x'
é       x

BeautifulSoup accepts Unicode input:

>>> import bs4
>>> bs4.BeautifulSoup(u'\xe9' + '\t' + u'x')
<html><body><p>é        x</p></body></html>
>>> bs4.__version__
'4.2.1'

Avoid unnecessary conversions to/from Unicode. Decode once input data into Unicode and use it everywhere to represent text in your program, and encode output to bytes at the end (if necessary):

with open('output.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))
jfs
  • 399,953
  • 195
  • 994
  • 1,670