I know that this is an ever presenting problem when working with Python 2.x. I'm currently working with Python 2.7. The text content that I'm wanting to output to a tab delimited text file is being pulled from a Sql Server 2012 database table that is by has the Server Collation set to SQL_Latin1_General_CP1_CI_AS
.
The exception I get tends to vary a little, but essentially is : UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 57: ordinal not in range(128)
or UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 308: ordinal not in range(128)
Now here is what I typically turn to, but still result in an error:
from kitchen.text.converters import getwriter
with open("output.txt", 'a') as myfile:
#content processing done here
#title is text pulled directly from database
#just_text is content pulled from raw html inserted into beautiful soup
# and using its .get_text() to just retrieve the text content
UTF8Writer = getwriter('utf8')
myfile = UTF8Writer(myfile)
myfile.write(text + '\t' + just_text)
I have also tried:
# also performed for just_text and still resulting in exceptions
title = title.encode('utf-8')
and
title = title.decode('latin-1')
title = title.encode('utf-8')
and
title = unicode(title, 'latin-1')
I have also replaced the with open()
with:
with codecs.open("codingOutput.txt", mode='a', encoding='utf-8') as myfile:
I'm not sure what it is that I'm doing wrong, or forgetting to do. I have also swapped the encode with decode just in case I've been doing the encoding/decoding backwards. with no success.
any help would be greatly appreciated.
Update
I have added print repr(title)
and print repr(just_text)
and both when I first retrieved title
from the database and when performing the .get_text()
. Not sure how much this helps but....
for title I get: <type 'str'>
for just_text I get: <type 'unicode'>
Errors
These are errors I'm getting from the content pulled from the BeautifulSoup Summary()
function.
C:\Python27\lib\site-packages\bs4\dammit.py:269: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
C:\Python27\lib\site-packages\bs4\dammit.py:273: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
C:\Python27\lib\site-packages\bs4\dammit.py:277: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:3] == b'\xef\xbb\xbf':
C:\Python27\lib\site-packages\bs4\dammit.py:280: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\x00\x00\xfe\xff':
C:\Python27\lib\site-packages\bs4\dammit.py:283: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
elif data[:4] == b'\xff\xfe\x00\x00':
ValueError: Expected a bytes object, not a unicode object
The trace back portion is:
File <myfile>, line 39, in <module>
summary_soup = BeautifulSoup(page_summary)
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 193, in __init__
self.builder.prepare_markup(markup, from_encoding)):
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 99, in prepare_markup
for encoding in detector.encodings:
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 256, in encodings
self.chardet_encoding = chardet_dammit(self.markup)
File "C:\Python27\lib\site-packages\bs4\dammit.py", line 31, in chardet_dammit
return chardet.detect(s)['encoding']
File "C:\Python27\lib\site-packages\chardet\__init__.py", line 25, in detect
raise ValueError('Expected a bytes object, not a unicode object')
ValueError: Expected a bytes object, not a unicode object