0

I'm having difficulty parsing data with a lot of scientific and international symbols using Python 2.7 so I wrote a toy program that illustrates what is not making sense to me:

#!/usr/bin/python
# coding=utf-8
str ="35 μg/m3"
str = str.decode('utf-8') 
str = str.encode('utf-8') #ready for printing? 
print(str)

And instead of printing out the original content, I get something different:

screen copy

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
Anthony
  • 722
  • 1
  • 9
  • 25
  • 1
    Shouldn't you encode first and then decode/directly print it? – be_good_do_good Jul 31 '16 at 20:56
  • 2
    You're going to want to use Python3 if you're dealing with unicode, unless you can't. Or just like pain. – Wayne Werner Jul 31 '16 at 20:57
  • 1
    Works fine on Python 2.7.9... maybe try `# -*- coding: latin-1 -*-` ... – l'L'l Jul 31 '16 at 20:59
  • It didn't work on my 2.7.12. I'll probably go with Wayne's suggestion. – Anthony Jul 31 '16 at 21:01
  • 3
    obviously your console is not set to understand UTF-8. – Antti Haapala -- Слава Україні Jul 31 '16 at 21:03
  • Your input is *already* UTF8 - at least that's what you tell Python with that `encoding` line. Does your text editor confirm the source file is encoded as UTF8 as well? Does the line print okay without those decode/encode steps? – Jongware Jul 31 '16 at 21:03
  • `# -*- coding: utf-8 -*-` and `print (u"35 μg/m3".encode("utf-8")).decode("utf-8")` So special chars to `unicode` after `encode` , if save `decode` as `utf-8` – dsgdfg Jul 31 '16 at 21:06
  • It's clearly something to do with Windows and PowerShell. I just ran the program successfully on my linux box. I'll leave the question open in case someone knows the specific PowerShell or Windows quirk responsible for this. I think Antti Haaapala is correct though I didn't see any settings in the PowerShell properties to set the encoding. – Anthony Jul 31 '16 at 21:14
  • 3
    Possible duplicate of [Unicode characters in Windows command line - how?](http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how) - See in the command prompt properties, it tells you what the codepage is (Latin-1), change it with `chcp`, use Lucida Console font, save your Python file as a UTF-8 encoded file, then print the string directly without encoding or decoding - http://i.imgur.com/hL7pz78.png – TessellatingHeckler Jul 31 '16 at 22:11
  • `chcp` is for CMD. Don't use it in PowerShell. If for some reason you need to modify the output encoding in PowerShell use [`$OutputEncoding`/`[Console]::OutputEncoding`](https://blogs.msdn.microsoft.com/powershell/2006/12/11/outputencoding-to-the-rescue/). I don't think it's required in this case, though. – Ansgar Wiechers Jul 31 '16 at 23:12

3 Answers3

0

The line # coding=utf-8 only helps to write unicode literal and is no use for plain byte strings. Anyway assuming that your Python file is UTF-8 encoded, the line str = str.decode('utf-8') gives you a correct unicode string.

But as said by Ansgar Wiechers, as you declare your encoding the simpler way would be to directly use a unicode litteral:

str = u"35 μg/m3"

Simply, Windows console has poor support for UTF8. Common encodings are win1252 (a latin1 variant), or cp850 a native OEM font. Unless you want to explicitely deal with the explicit encoding, your best bet is to directly display the unicode string:

#!/usr/bin/python
# coding=utf-8
str ="35 μg/m3"
str = str.decode('utf-8') # str is now an unicode string
print(str)

If you want to explicitely use latin1, and provided you use a TrueType font such as Lucida Console or Consolas, you can do:

chcp 1252
python .\encoding.py

with

#!/usr/bin/python
# coding=utf-8
str ="35 μg/m3"
str = str.decode('utf-8') # str is now an unicode string
str = str.encode('latin1') # str is now an latin1 encoded byte string
print(str)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • I think you're confusing PowerShell with CMD here. – Ansgar Wiechers Jul 31 '16 at 22:35
  • @AnsgarWiechers: AFAIK, OP executes PowerShell in a console. His problem is only related to console encoding and PowerShell does honour the `chcp` command. So yes, this answer will be valid in a cmd.exe shell, and not I am not confusing because it is also valid in PowerShell. – Serge Ballesta Aug 01 '16 at 07:27
  • [`$OutputEncoding`](https://blogs.msdn.microsoft.com/powershell/2006/12/11/outputencoding-to-the-rescue/) is a better way to deal with output encoding issues in PowerShell, regardless of how/where PowerShell was started. – Ansgar Wiechers Aug 01 '16 at 08:43
  • @AnsgarWiechers: OP's question is related to Python output in a PowerShell. Python 2.7 does not use PowerShell specific tools, so `$OutputEncoding` is no use here. But I agree with you, `[Console]::OutputEncoding` could be used here too. – Serge Ballesta Aug 01 '16 at 09:01
0

Python 2.7 doesn't use Unicode strings by default, so you basically have 2 options:

  • Define the string as a Unicode string literal (u"..."):

    # coding=utf-8
    str = u"35 µg/m3"
    print(str)
    

    This way you can simply use the string as one would expect, so I'd prefer this approach.

  • Define the string as a regular string literal and decode it:

    # coding=utf-8
    str = "35 \xc2\xb5g/m3"
    print(str.decode('utf-8'))
    

    If you use this approach you need to put special characters as their hexadecimal values (µ in UTF-8 is the character sequence 0xC2,0xB5) even if the file is saved as UTF-8.

Demonstration:

PS C:\> $PSVersionTable.PSVersion.ToString()
4.0
PS C:\> C:\Python27\python.exe -V
Python 2.7.11
PS C:\> Get-Content .\test.py -Encoding UTF8
# coding=utf-8
str1 = "35 \xc2\xb5g/m3"
print(str1)
print(str1.decode('utf-8'))
str2 = u"35 µg/m3"
print(str2)
PS C:\> C:\Python27\python.exe .\test.py
35 ┬Ág/m3
35 µg/m3
35 µg/m3
Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
0

Your decoding/encoding has no effect:

# coding=utf-8
s1 = "35 μg/m3"
s2 = s1.decode('utf-8') 
s3 = s2.encode('utf-8') #ready for printing?
print s1==s3

If your source is UTF-8 as declared, then s1 is a byte string that is UTF-8-encoded already. Decoding it to a Unicode string (s2) and re-encoding it as UTF-8 just gives you the original byte string.

Next, the Windows console does not default to UTF-8, so printing those bytes will intepret them in the console encoding, which on my system is:

import sys
print sys.stdout.encoding
print s3

Output:

cp437
35 ┬╡g/m3

The correct way to print Unicode strings and have them intepreted correctly is to actually print Unicode strings. They will be encoded to the console encoding by Python and display correctly (assuming the console font and encoding supports the characters).

# coding=utf-8
s = u"35 µg/m3"
print s

Output:

35 µg/m3
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251