17

This is the code:

print '"' + title.decode('utf-8', errors='ignore') + '",' \
      ' "' + title.decode('utf-8', errors='ignore') + '", ' \
      '"' + desc.decode('utf-8', errors='ignore') + '")'

title and desc are returned by Beautiful Soup 3 (p[0].text and p[0].prettify) and as far as I can figure out from BeautifulSoup3 documentation are UTF-8 encoded.

If I run

python.exe script.py > out.txt

I get following error:

Traceback (most recent call last):
  File "script.py", line 70, in <module>
    '"' + desc.decode('utf-8', errors='ignore') + '")'
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 264
: ordinal not in range(128)

However if I run

python.exe script.py

I get no error. It happens only if output file is specified.

How to get good UTF-8 data in the output file?

Kaitnieks
  • 922
  • 1
  • 8
  • 15
  • 1
    You’re violating the *Don’t Repeat Yourself* principle by calling `decode` more than once. In fact, you shouldn’t be calling it at all. Just set the encoding on standard output and be done with it. The bug (Python’s, not yours) is that Python has this really annoying behavior in that it treats redirected output differently than it does unredirected output. – tchrist Apr 04 '12 at 20:15
  • 2
    Right now I'm not writing perfect code, I'm just trying various things that I can grasp from various tutorials until I figure out what works (voodoo coding, I believe) - then I'll make it neat and DRY. This is the first day I'm using Python and I'm not impressed so far. – Kaitnieks Apr 04 '12 at 20:20
  • Python doesn’t have a very good Unicode model, at least in Python2. You should be using Python3 if you can. What languages are you more used to? Have you considered simply setting your `PYTHONIOENCODING` environment variable to "utf8" and letting the chips fall where they may? – tchrist Apr 04 '12 at 20:22
  • 2
    You also shouldn't generally be using `errors='ignore'`, it hides errors in your code. – agf Apr 04 '12 at 20:23
  • Mostly Delphi, PHP, Javascript, but touched other as well. Normally I've seen 2 models to handle strings - either they are internally Unicode and decoded/encoded on input/output or they're byte representations internally of whatever was in input and converted only when it's necessary. Python seems to do both and according to other comments decoding can happen or not depending on various hidden things. I'm not yet out of options to try (thanks to SO) so I'm sure the solution will come. – Kaitnieks Apr 04 '12 at 20:28
  • Try using the environment variable I mentioned. If you have Unicode strings, they will be correctly encoded automatically for you. – tchrist Apr 04 '12 at 20:30
  • Just tried it but doesn't seem to help (UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)) - even though I removed the decode calls. I know I must be doing something fundamentally wrong, some newbie error, but I don't know what yet. – Kaitnieks Apr 04 '12 at 20:37
  • Sounds like something is handing you UTF-8–encoded byte strings, not already-decoded abstract Unicode strings. I would try to keep only Unicode strings in your program, not byte strings, as they will eventually bite you. It seems like a waste to go back and forth like that; maybe you can get it to just give you regular strings? – tchrist Apr 04 '12 at 20:41
  • Is there a way to tell Python "just write the byte string as it is and don't try to do any conversion whatsoever"? Last thing I'm trying is foutout.write(desc) where foutout = open("out.txt", "wb") but _even_ that causes UnicodeEncodeError for some of the data. – Kaitnieks Apr 04 '12 at 20:50
  • @Kaitnieks: Yes, just `print` your byte string directly. However what you get from BeautifulSoup is typically a Unicode string not a byte string. When you `print` `title.decode()` you are implicitly encoding the Unicode string to bytes so that it can be decoded, then explictly decoding to Unicode, then implicitly encoding back to bytes that can be printed! – bobince Apr 05 '12 at 10:56
  • I suggest `s= u'"%s", "%s", "%s"' % (title, title, desc)` then `print s.encode('utf-8')` if you are sure you always want UTF-8 bytes out. – bobince Apr 05 '12 at 11:01
  • Possible duplicate of [Setting the correct encoding when piping stdout in Python](http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python) – techraf Jan 10 '16 at 23:16

4 Answers4

12

You can use the codecs module to write unicode data to the file

import codecs
file = codecs.open("out.txt", "w", "utf-8")
file.write(something)

'print' outputs to the standart output and if your console doesn't support utf-8 it can cause such error even if you pipe stdout to a file.

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
  • Is there any codec that would output byte strings as they are without trying to convert them, like, 'raw' or something? – Kaitnieks Apr 04 '12 at 20:54
  • @Kaitnieks: here is the list of all supported encodings http://docs.python.org/library/codecs.html#standard-encodings – Maksym Polshcha Apr 04 '12 at 21:00
  • This actually worked, once I got my strings to Unicode. I had to (sadly) abandon using .prettify() because it returned string instead of Unicode string for this. Thanks. – Kaitnieks Apr 04 '12 at 22:13
  • Actually, encoding your strings to utf-8 and writing to console might display weird, but it it will not cause errors, even if you redirect output to a file. It's only when you try to write out raw unicode that you'll trigger python's automatic conversion, which will fail on lossy conversion to ascii. – alexis Apr 05 '12 at 11:10
7

Windows behaviour in this case is a bit complicated. You should listen to other advices and do internally use unicode for strings and decode during input.

To your question, you need to print encoded strings (only you know which encoding!) in case of stdout redirection, but you have to print unicode strings in case of simple screen output (and python or windows console handles conversion to proper encoding).

I recommend to structure your script this way:

# -*- coding: utf-8 -*- 
import sys, codecs
# set up output encoding
if not sys.stdout.isatty():
    # here you can set encoding for your 'out.txt' file
    sys.stdout = codecs.getwriter('utf8')(sys.stdout)

# next, you will print all strings in unicode
print u"Unicode string ěščřžý"

Update: see also other similar question: Setting the correct encoding when piping stdout in Python

Community
  • 1
  • 1
Jiri
  • 16,425
  • 6
  • 52
  • 68
1

It makes no sense to convert text to unicode in order to print it. Work with your data in unicode, convert it to some encoding for output.

What your code does instead: You're on python 2 so your default string type (str) is a bytestring. In your statement you start with some utf-encoded byte strings, convert them to unicode, surround them with quotes (regular str that are coerced to unicode in order to combine into one string). You then pass this unicode string to print, which pushes it to sys.stdout. To do so, it needs to turn it into bytes. If you are writing to the Windows console, it can negotiate somehow, but if you redirect to a regular dumb file, it falls back on ascii and complains because there's no loss-less way to do that.

Solution: Don't give print a unicode string. "encode" it yourself to the representation of your choice:

print "Latin-1:", "unicode über alles!".decode('utf-8').encode('latin-1')
print "Utf-8:", "unicode über alles!".decode('utf-8').encode('utf-8')
print "Windows:", "unicode über alles!".decode('utf-8').encode('cp1252')

All of this should work without complaint when you redirect. It probably won't look right on your screen, but open the output file with Notepad or something and see if your editor is set to see the format. (Utf-8 is the only one that has a hope of being detected. cp1252 is a likely Windows default).

Once you get that down, clean up your code and avoid using print for file output. Use the codecs module, and open files with codecs.open instead of plain open.

PS. If you're decoding a utf-8 string, conversion to unicode should be loss-less: you don't need the errors=ignore flag. That's appropriate when you convert to ascii or Latin-2 or whatever, and you want to just drop characters that don't exist in the target codepage.

alexis
  • 48,685
  • 16
  • 101
  • 161
  • Wow, that’s just awful. You never have to do *anything* like that in [insert many other languages]. You really expect people to call two functions calls for every single output statement? What a disaster! You are so so so violating *Don’t Repeat Yourself*. You should just be able to set the encoding on the output and forget about it. – tchrist Apr 04 '12 at 21:57
  • You don't have to, actually. The OP just made a mess of his unicode handling. With a bit of understanding of what's going on, the conversions can be limited to what's necessary. And in python 3 it's conceptually clearer what's going on. And if you want to set the encoding on sys.stdout, you can, but that's a different issue. – alexis Apr 05 '12 at 11:03
  • That’s what I was thinking, but I wasn’t sure where he’d gone wrong. I really only ever work in Python3, because I find the Unicode handling in Python2 too tedious. – tchrist Apr 05 '12 at 13:15
  • Basically, it makes no sense to convert text to unicode in order to print it. If you have multi-language text, "decode" it to unicode on input, do all the processing in unicode, and encode again (to utf-8 or something else) for writing out. – alexis Apr 05 '12 at 15:02
1

Problem: if you run on Windows:

python.exe script.py

The following will be in effect:

sys.stdout.encoding: utf-8
sys.stdout.isatty(): True

But, if you run:

python.exe script.py > out.txt

you will effectivelly have this:

sys.stdout.encoding: cp1252
sys.stdout.isatty(): False

So, possible solution (IN PYTHON > 3.7):

import sys
if not sys.stdout.isatty():
    sys.stdout.reconfigure(encoding='utf-8')

print '"' + title.decode('utf-8', errors='ignore') + '",' \
      ' "' + title.decode('utf-8', errors='ignore') + '", ' \
      '"' + desc.decode('utf-8', errors='ignore') + '")'

See Also: How to set sys.stdout encoding in Python 3?

SergeF
  • 59
  • 4