Python encoding - Is there any explanation?

Question

Can someone explain to me why python has this behaviour?

Let's me explain.

BACKGROUND

I have a python installation and I want to use some chars that aren't in the ASCII table. So I change my python default enconding. I save every string, into a file .py, in that way '_MAIL_TITLE_': u'Бронирование номеров',

Now, with a method that replaces my dictionary keys, I want to insert into an html template my strings in a dynamic way.

I place into html page's header:

<head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
 ...... <!-- Some Css's --> 
</head>

Unfortunately, my html doc comes to me (after those replaces) with some wrong chars (unconverted? misconverted?)

So, I open a terminal and start to make some order:

 1 - Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
 2 - [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
 3 - Type "help", "copyright", "credits" or "license" for more information.
 4 - >>> import sys
 5 - >>> sys.getdefaultencoding()
 6 - 'utf-8'
 7 - >>> u'èéòç'
 8 - u'\xe8\xe9\xf2\xe7'
 9 - >>> u'èéòç'.encode('utf-8')
10 - '\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
11 - >>> u'è'
12 - u'\xe8'
13 - >>> u'è'.encode()
14 - '\xc3\xa8'

QUESTION

Take a look at line [7-10]. Isn't that weird? Why if my (line 6) python has a defaultencoding utf-8, does it convert that string (line7) in a different way than line 9 does? Now, take a look at lines [11-14] and their output.

Now, i'm totally confused!

THE HINT

So, I've tried to change my terminal way of input files (previously ISO-8859-1, now utf-8) and something changed:

 1 - Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
 2 - [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
 3 - Type "help", "copyright", "credits" or "license" for more information.
 4 - >>> import sys
 5 - >>> sys.getdefaultencoding()
 6 - 'utf-8'
 7 - >>> u'èéòç'
 8 - u'\xc3\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
 9 - >>> u'èéòç'.encode('utf-8')
10 - '\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
11 - >>> u'è'
12 - u'\xe8'
13 - >>> u'è'.encode()
14 -'\xc3\xa8'

So, the encoding (explicit encoding) works independently from input encoding (or it seems to me, but I'm stuck on this for days, so maybe I messed up my mind).

WHERE IS THE SOLUTION??

By looking at lines 8 of background and hint, you can see that there are some differences of unicode's object that are created. So, I've started to thought about it. What have I concluded? Nothing. Nothing except that, maybe, my encoding problems are into file's encoding once a save my .py (that, contains all utf-8 characters that have to be inserted into html document)

THE "REAL" CODE

The code does nothing special: it opens an html template, place it into a string, replace place holders with unicode (utf-8ed ? wish yes) strings and save it into another file that will be visualizated from the Internet (yes, my "landing" page have into header utf-8's specifications). I don't have code here because it is scattered into several files, but I'm sure of the program's workflow (by tracing it).

FINAL QUESTION

In the light of this, does anybody have any idea for making my code work? Ideas about unix file encoding? Or .py file encoding? How can I change the encoding to make my code work?

LAST HINT

Before substitution of place holders with utf-8 object, if I insert a

utf8Obj.encode('latin-1')

my document is perfectly visible for the internet!

Thanks to those who answer.

EDIT1 - DEVELOPMENT WORKFLOW

Ok, that's my development workflow:

I have a CVS for that project. The project is located onto a centos OS. That server is a 64-bit machine. I develop my code into a Windows 7 (64-bit) with eclipse. Every modification is committed ONLY with CVS commit. The code is exectude onto Centos machine that use that kind of python:

Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2

I setted Eclipse for work in that way: PREFERENCES -> GENERAL -> WORKSPACE -> TEXT FILE ENCODING : UTF-8

A Zope/Plone application run onto the same Server: it serves some PHP pages. PHP pages calls some python methods (application logic) by WS that are located onto Zope/Plone "server". That server interface directly to application logic.

That's all

EDIT2

This is the function that does the replace:

    def _fillTemplate(self, buf):
    """_fillTemplate(buf)-->str
    Ritorna il documento con i campi sostituiti con dict_template.
    """
    try:    
        for k, v in self.dict_template.iteritems():
            if not isinstance(v,unicode):
                v=str(v)
            else:
                v=v.encode('latin-1') #In that way it works, but why?
            buf = buf.replace(k, v)

As usual, the first thing to check is if the encoding chain is broken somewhere. What editor do you use ? Is it set to save files as UTF8 or Latin-1 ? What is the encoding of you HTML page (not the header, the encoding with witch is saved). What is the encoding of the python code file ? Do you set the python file encoding header to utf8 ? — Bite code, Mar 09 '12 at 09:28
@e-satis : I use eclipse (and cvs for committing my code) and it is setted for work in utf-8. My html page is written with eclipse aswell. What's the encoding of python code file is, to me, a mystery. How can I retrive that? The last answer is good for last question aswell — DonCallisto, Mar 09 '12 at 09:35
Did you code the Python code file ? If yes, you must know the encoding, since you wrote it. And it's probably utf8 if you wrote it with eclipse. If you did't, then you can't use it without knowing the encoding. — Bite code, Mar 09 '12 at 09:42
I code my own code, obviously. But the weird thing is that if I change ancoding of file (from eclipse) and want to commit it back into repository, my CVS say to me that aren't any differences from REP.FILE to LOC.FILE. More over, how can you explain the "last hint" section? — DonCallisto, Mar 09 '12 at 09:45
I don't understand "my terminal way of input files" ? What does it mean ? What was ISO-8859-1, previously ? — Bite code, Mar 09 '12 at 09:48
I use putty as agent for connect from my windows environment, to a remote shell on my centos servers. So, putty have - as you say in your answer - his own way to encode input. It was, in the first try, setted to ISO-8859-1. But this isn't strictly related to my html document problem. Is just an information that drive me to "what encode have my .py files?". I have an idea: exist a corresponding function of php bin2hex ? In that way, i can show you the output of my unicode chars directly by opening my .py — DonCallisto, Mar 09 '12 at 09:52
You mean you don't have a server setup on your machine ? You can't fire a Python interpretteur doing just that on your local computer ? Where do you write the code then, on windows ? Which version ? Which encoding ? And no it's not "just an information that drive me to "what encode have [your] .py files". It's an additional layer. So additional complexity. Everytime you add a layer with encoding, you add another opportunity for bugs. Please ADD ALL YOUR WORKFLOW IN THE OP. — Bite code, Mar 09 '12 at 10:01
@DonCallisto: e-satis has obviously taken time and effort to help you. He certainly deserves more courtesy than what can be found in your comment. To go back to the matter at hand: I would advise that you avoid "magic", when programming; understanding what you do will first require some serious time investment, but will save time in the long run. — Eric O. Lebigot, Mar 10 '12 at 02:35
@EOL : I wasn't the first to spend some words out of place... I know what does it means to try to explain something to someone that, at the first, didn't understand you or all the complex system underlying. But I suppose that this community was created for this; rep. is only a "gift" that someone give you. The target is help users, not helping them for rep, imho. And of course, yes, i know that "magic" isn't the way but i'm not the kind of developer that, once a problem is solved, then turn off the computer and goes away.... Pretend to judge behind keybord and monitor is wrong. — DonCallisto, Mar 10 '12 at 09:11

score 5 · Answer 1 · edited May 23 '17 at 11:48

5

In order to solve this and future problems, I would advise that you look at the answers to question UnicodeDecodeError when redirecting to file, which contains a general discussion of what this encoding/decoding business is about.

In the first example, your terminal encodes in Latin1:

7 - >>> u'èéòç'
8 - u'\xe8\xe9\xf2\xe7'

The encoding of these characters in Latin1 is a valid encoding of the same characters in UTF-8, so Python does not need to do any conversion. When you switch your terminal to UTF-8, you get

7 - >>> u'èéòç'
8 - u'\xc3\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'

Your terminal sends UTF-8 encodings to Python, as four 2-byte sequences. Your Python interpreter took these bytes verbatim and kept them: they are also a valid encoded representation of your string; UTF-8 can in fact encode the same characters in multiple ways.

If your editor saves UTF-8, then you should put the following on top of your .py file:

# -*- coding: utf-8 -*-

This line must match the encoding used by your editor.

The most robust approach to handling encodings, is probably one of the following two:

Your program should only manipulate internally (byte) strings in a single encoding (UTF-8 is a good choice). This means that if you get, say, Latin-1-encoded data, you should re-encode it into UTF-8:
```
data.decode('latin1').encode('utf8')
```
The best way of handling your string literals, in this case, is to have your editor save your file in UTF-8 and use the regular (byte) string literals ("This is a string", with no u in front).
Your program can alternatively only manipulate Unicode strings. My experience is that this is a little cumbersome, with Python 2. This would be my method of choice with Python 3, though, because Python 3 has a much more natural support for these encoding issues (litteral strings are character strings, not byte strings, etc.).

edited May 23 '17 at 11:48

Community

1
1

answered Mar 09 '12 at 09:26

Eric O. Lebigot

91,433
48
218
260

This is an interesting point. But, how about the rest of the question? – DonCallisto Mar 09 '12 at 09:27
If I put that line, everything will be okay. But, if i want to extract my string from a WS (xmlrpc protocol) in that way, all will be messed up. If I don't include that directive, my WS request are satisfy in the right way. So, to me, is a dog that eat his tail... – DonCallisto Mar 09 '12 at 09:31
If you give us the relevant lines of code, or a small program that reproduces your problem, then I can probably help more. – Eric O. Lebigot Mar 09 '12 at 09:31
1

Don't -1 this. It's part of the answer. And answer we can't give you completly because encoding problems come with a lot of variables, most of them being PBCK. – Bite code Mar 09 '12 at 09:32
@DonCallisto: The `coding` directive only lists the encoding used by your *editor*. It has nothing to do with data that you manipulate. As for your other question about WS, I will add some comments in the answer. – Eric O. Lebigot Mar 09 '12 at 09:34
@EOL : but if I add that directive, my WS methods will no more return data in the correct way. I use zope/plone services for WS, so the code will only return a dictonary (by return statement). No xmlrpc lib are involved (since, i suppose, zope/plone do it in an automatic way) – DonCallisto Mar 09 '12 at 09:38
1

In your question, you sayd there is nothing special with the code, just HTML. Now you are talking about xmlrpc. This is much more complicated. You must add the complete workflow to the question. Plus, setting the encoding header is not adding a problem, is revealing a problem. Force yourself to add the encoding header everywhere. It should work with it, provided that your editor does save the file in this encoding. This way we will have a base to work from. – Bite code Mar 09 '12 at 09:46
@e-satis : maybe i didn't explain well. xmlrpc (that is another face of the same question) isn't strictly related to my document HTML. I have to "separated" parts that "feed" themselves from the same file. Now, if I add the directive that EOL suggest to me, my HTML document (that is what i'm asking about) works fine. The other part (that receive labels from WS calls and is a php page, so the replace isn't local, but is into server) instead, breaks. In that scenario, instead, my php works perfectly but html document (local) doesn't – DonCallisto Mar 09 '12 at 09:58
@EOL : I'll try to save with my editor into UTF-8 format and remove all 'u'. For the moment, thank you for support and help. – DonCallisto Mar 09 '12 at 10:51
@EOL : i've print the result (after remove 'u' and save into UTF-8 format, aswell keep string with 'u') of hex of strings from application logic (python) to zope and php and they're the same in both case .... So, why when i specific u'str' PHP interprets it in the right way but when i strip it off, the string isn't intepreted as utf-8 ? i'm a little bit confuse. – DonCallisto Mar 09 '12 at 14:55
1

@DonCallisto: I would suggest that you read and ponder over my answer in the StackOverflow question I linked to; I would advise that you then spend time checking each and every step of your encoding chain, in your program. Just step back, relax, and think thoroughly about each part of your program handles these encoding issues. This is not easy, and this takes time, but knowing precisely what you do at every step will greatly help. – Eric O. Lebigot Mar 10 '12 at 02:12
@EOL : yes thanks, but my answer undere here is the solution. Changed the xmlrpc of php encoding method and all's good. – DonCallisto Mar 10 '12 at 09:06
@DonCallisto: I am glad that you found a solution. However, not understanding why a solution works leaves the door open to unforeseen bugs–I am not saying that your solution is wrong, but only that I can't know whether it is right without more information on your encoding chain. – Eric O. Lebigot Mar 10 '12 at 10:15
@EOL : yes, that's perfectly clear to me that is difficult for you without the complete view of the chain. But the problem was, between PHP xmlrpc lib and Zope. For understand that, i made a myself xml client that call, with xmlrpc the same function called by php. With this client, data were good. So xmlrpc for php comes with ISO-8859-1 as default header of xmlrpc messagge and that's messed up all. Changing it into utf-8 is the way. Obv. have to strip all the `u` of the code. And obv, with the `u` use to work because, as i say, i've fear that unicode object onto python "like" latin-1 encoding – DonCallisto Mar 10 '12 at 10:23

score 5 · Answer 2 · answered Mar 09 '12 at 09:40

While you answer to my comment, here is the answer of the first question:

Take a look to line [7-10]. Isn't weird? Why if my (line 6) python have a defaultencoding in utf-8, then convert that string (line7) in a different way that line 9 does? Now, take a look to lines [11-14] and their output..

No it's not weird: you must distinguish between Python encoding, shell encoding, system encoding, file encoding, declared file encoding and applied encoding. Makes a lot of of encoding, isn't it ?

sys.getdefaultencoding()

This will give you the encoding Python use for the unicode implementation. This as nothing to do with output.

In [7]: u'è'
Out[7]: u'\xe8'
In [8]: u'è'.encode('utf8')
Out[8]: '\xc3\xa8'
In [9]: print u'è'
è
In [10]: print u'è'.encode('utf8')
è

When you use print, the caracter is printed to the screen, if you don't, Python gives you the a representation that you can copy/paste to obtain the same data.

Since a unicode string is not the same as a utf8 string, it doesn't give you the same data.

Unicode is a "neutral" representation of the string, while utf8 is an encoded one.

Ok, now this is enough clear to me... But i'm still stucked with my problem. — DonCallisto, Mar 09 '12 at 09:42

score 3 · Answer 3 · edited Mar 10 '12 at 02:17

In Line 7 you output a Unicode object:

>>> u'èéòç'
u'\xe8\xe9\xf2\xe7'

No encoding happens, it just tells you that your input consists of the Unicode code units \xe8, \xe9, and so on.

In line 11 you create a UTF-8 encoded string from the Unicode object. Output of the encoded string looks different from the unencoded Unicode object, but why wouldn't it:

>>> u'èéòç'.encode('utf-8')
'\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'

In your second experiment, where you changed the terminal encoding, you actually broke the interpretation of input characters:

>>> u'èéòç'
u'\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'

When you put in those four characters in the string, they got encoded somewhere in some way and Python then thinks you had typed in eight UTF-8 code unit bytes. But those bytes don't represent the characters you wanted to type in. It looks like Python thinks it will gets ISO-8859-1 characters from the terminal while it actually gets UTF-8 data, resulting in a mess.

@sth: The representation u'\xe8\xe9\xf2\xe7' generally contains code *units*, not code *points*. A code points is the number associated to a character; code points are unrelated to the notion of encoding. A code unit is the byte representation of a character, and is what encoding is about. I fixed this in a couple of places in your answer. Reference: http://en.wikipedia.org/wiki/Code_unit. — Eric O. Lebigot, Mar 10 '12 at 02:26

Python encoding - Is there any explanation?

3 Answers3