It makes no sense to convert text to unicode in order to print it. Work with your data in unicode, convert it to some encoding for output.
What your code does instead: You're on python 2 so your default string type (str
) is a bytestring. In your statement you start with some utf-encoded byte strings, convert them to unicode, surround them with quotes (regular str
that are coerced to unicode in order to combine into one string). You then pass this unicode string to print
, which pushes it to sys.stdout
. To do so, it needs to turn it into bytes. If you are writing to the Windows console, it can negotiate somehow, but if you redirect to a regular dumb file, it falls back on ascii and complains because there's no loss-less way to do that.
Solution: Don't give print
a unicode string. "encode" it yourself to the representation of your choice:
print "Latin-1:", "unicode über alles!".decode('utf-8').encode('latin-1')
print "Utf-8:", "unicode über alles!".decode('utf-8').encode('utf-8')
print "Windows:", "unicode über alles!".decode('utf-8').encode('cp1252')
All of this should work without complaint when you redirect. It probably won't look right on your screen, but open the output file with Notepad or something and see if your editor is set to see the format. (Utf-8 is the only one that has a hope of being detected. cp1252 is a likely Windows default).
Once you get that down, clean up your code and avoid using print for file output. Use the codecs
module, and open files with codecs.open
instead of plain open.
PS. If you're decoding a utf-8
string, conversion to unicode should be loss-less: you don't need the errors=ignore
flag. That's appropriate when you convert to ascii or Latin-2 or whatever, and you want to just drop characters that don't exist in the target codepage.