Dangers of sys.setdefaultencoding('utf-8')

how would you be using it? If you are talking about modifying sitecustomize.py then when the code is run on other computers you may well have bugs – Padraic Cunningham Feb 22 '15 at 11:42

If you have a decode or encode error it is probably for an obvious reason i.e `s = u'é' str(s)` . You should work with one type either string or unicode and handle the encoding explicitly. – Padraic Cunningham Feb 22 '15 at 11:57

@PadraicCunningham, http://stackoverflow.com/questions/28642781/hack-jinja2-to-encode-from-utf-8-instead-of-ascii, no global settings - application-only. – anatoly techtonik Feb 22 '15 at 12:06

might be relevant https://mail.python.org/pipermail/python-dev/2009-August/091406.html *You can get strange effects caused by the fact that some string objects will now compare equal while not necessarily having the same hash value. Unicode objects and strings have the same hash value provided that they are both ASCII. With the ASCII default encoding, a non-ASCII string cannot be compared to a Unicode object, so the problem does not occur.* – Padraic Cunningham Feb 22 '15 at 13:05

@PadraicCunningham, `UTF-8` string is a not a Unicode object yet, and regardless of the encoding such string objects won't compare equal if they have different contents. Unless there is a bug in Python hash function, – anatoly techtonik Feb 25 '15 at 11:52

Because you are misunderstanding how Python works with encodings if you think you need it. Here’s a presentation of how to use it **correctly**: http://farmdev.com/talks/unicode/ – As an aside, if the argument “it hides bugs” doesn’t sound convincing to you, *that* may be the real problem. (And yes, Unicode in Python 2 sucks. But `sys.setdefaultencoding` isn’t the solution.) And lastly, if you want to see a bug it causes, look no further: http://stackoverflow.com/a/28627705/1968 – Konrad Rudolph Feb 25 '15 at 15:55

@KonradRudolph, that's why I am asking for a real example that I can understand. – anatoly techtonik Feb 26 '15 at 08:42

@techtonik here's [an example of a question](http://stackoverflow.com/questions/25250857/unsuppress-unicodeencodeerror-exceptions-when-run-from-aptana-studio-pydev) where a user got screwed because the Author of PyDev thinks it's a good idea to set `sys.setdefaultencoding('utf-8')`. Here's a [blog post](https://opensourcehacker.com/2010/01/24/aptana-studio-eclipse-pydev-default-unicode-encoding/) of someone else that got screwed by this with some more details and further links. – Lukas Graf Apr 12 '15 at 07:38

A nice posting today on the topic: https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/ – Mark Tolonen Jun 17 '15 at 06:00

Shouldn't this be a blog entry that you link to in a comment on Martijn's answer? – Wayne Conrad Apr 23 '15 at 20:40

thanks for the feedback, I provide now a summary of my investigations on top. – Red Pill Apr 24 '15 at 08:23

This answer is really far too long, and unnecessarily so. Most of your supporting arguments, the ones that take up the bulk of your post, appear to be nothing more an [argumentum ad populum](https://en.wikipedia.org/wiki/Argumentum_ad_populum) at best, and a [proof by verbosity](https://en.wikipedia.org/wiki/Proof_by_intimidation) at worst. Furthermore, the entire section about standardization and encoding is irrelevant and belongs in a blog post, not in an answer on Stack Overflow. Your answer would be much better if you simply distilled the *technical reasons* for your opinion, nothing more. – Alexis King Apr 24 '15 at 08:55

Some specific comments: Setting a different default is like using `goto`. Sure, you can make it work, but you'll have a harder time for it as you develop the application. You get to be inconsistent in your handling of Unicode and that is going to bite you. Most people that use it do *not* understand Unicode and think this is the easy way out. – Martijn Pieters Apr 25 '15 at 08:12

Arguments that a lot of GitHub code uses it is not proof that it is okay to use, it can also be taken as proof most developers do not understand how to use Unicode properly. You see the same issues with [how inexperienced developers use `super()`](https://stackoverflow.com/questions/19608134/why-is-python-3-xs-super-magic/19609168#19609168). Generally speaking, it is a [*Cargo Cult*](http://en.wikipedia.org/wiki/Cargo_cult), applied and misapplied without understanding *how it works* or if it is needed at all. – Martijn Pieters Apr 25 '15 at 08:17

You are right, a default should, quite generally, never be changed, just because problems go away magically and you don't know why. You _should_ know what u r doing. But IF you know what it does then Python2 is just way better to work with. Better than Py3 for me - but thats a different story ;-) – Red Pill Apr 25 '15 at 19:48

I also begin to understand that your main problem with it seems to be the (agreed) fact that your code could get _inconsistent_ regarding string types traveling through, some unicode some byte, while without the switch it would crash. Also here I'm with you: One should decide before writing the first Py2 l.o.c., if his lib or process should be working with unicode OR with bytes - consistently. We prefer bytes - with good reasons. – Red Pill Apr 25 '15 at 19:58

@MartijnPieters "You get to be inconsistent in your handling of Unicode and that is going to bite you. " Could you elaborate on _what_ exactly will be the problems biting us? So `setdefaultencoding` seems to be rather a safe way out. If something would break big time, wouldn't we have heard of it by now, and wouldn't that mean _that thing_ which breaks on using another default encoding needs to be fixed? Thanks for your insight. IMO the way Python 2.x continues to refuse to handle ASCII > 127 by default is rather arcane (though I'm all in favor of Python otherwise)... – miraculixx Jan 03 '16 at 11:42

@miraculixx: Python 2.0 was the first Python version to introduce Unicode support, in October 2000. It included the decision there and then to disable setting the default encoding. That means there is now *15 years* of legacy code out there that relies on being able to catch an exception when you try to concatenate non-ASCII bytes to bytes that are not decodable as ASCII, etc. You cannot possibly fix all that code. – Martijn Pieters Jan 04 '16 at 10:11

@miraculixx: and what you call 'arcane' is called *backwards and forwards compatibility*, a requirement when your language is used by billions of computers in the world. Python 3 could make the switch, because it did not make any promises about compatibility. – Martijn Pieters Jan 04 '16 at 10:12

> That means there is now 15 years of legacy code out there that relies on being able to catch an exception (...). Actually, the 15-years of legacy code relies on the standard lib to work with unicode (i.e. `sometext'.decode('whatever')`, and not supporting changing the defaultenconding IMHO is akin of saying we're not sure whether unicode support actually works [in the stdlib]. Anyway I get your point. Essentially it means switching defaultencoding is not officially supported, however as this answers points out under some circumstances there are advantages of doing so. Thanks for your POV. – miraculixx Jan 04 '16 at 17:13

Having this knowledge earlier we would have never needed Python 3, sick of wasting a decade of Python's community's time causing lack of innovation – nehem Nov 17 '17 at 02:55

@nehemiah: That pretty much sums up my original post into one line. – Red Pill Jan 13 '18 at 12:58

Thanks very much for the analysis. I had a program using modules that used str() in a way that caused the UnicodeDecodeError exception, and no easy way to fix them. Using the def.enc. solution was the only way to tame "bugs" in the modules. I used the reload/setdefaultencoding only under very controlled circumstances (to contain possible side effects) and have had no problems. Your post helped to alleviate concerns about the side effects, so was helpful to make me more comfortable with my solution. – Tim Bird Jul 16 '20 at 18:53

so you want to call sys.setdefaultencoding, but don't want to reload(sys)? introducing `pip install setdefaultencoding` ! >>> import setdefaultencoding >>> setdefaultencoding.setdefaultencoding – Thomas Grainger Jul 27 '20 at 16:32

Thanks @ThomasGrainger - hope you don't mind that I mention this one in the OP. – Red Pill Oct 30 '20 at 07:30

There is a mistake in `you don't always want to have your strings automatically decoded to Unicode` - the strings are decoded to UTF-8, not to Unicode objects. – anatoly techtonik Apr 10 '15 at 13:33

@techtonik: UTF-8 is an *encoding*, so they'd be encoded to UTF-8. That's the issue though, you get Unicode objects when you mix the two types; `str + unicode` gives you `unicode`, provided the `str` could be decoded. – Martijn Pieters Apr 10 '15 at 13:34

@techtonik: in my sample the `translations.get_label()` returns `unicode` objects. The WSGI implementation could also opt to just concatenate all the results, at which point you'd get one `unicode` object as output passed on to the socket, or perhaps to another WSGI wrapping label. We won't know, because we silenced all Python exceptions that normally would have been thrown. – Martijn Pieters Apr 10 '15 at 13:37

I don't get it. To me it is like you are saying that with `sys.setdefaultencoding("utf-8")` Python will start producing `unicode` objects in places where it was `str` previously. Is that right? (I am still reading through the example) – anatoly techtonik Apr 10 '15 at 13:40

A table about type conversion and contents of variable will definitely help to get that right. – anatoly techtonik Apr 10 '15 at 13:41

Python will try and decode `str` objects when concatenating with `unicode` objects, yes, and that will normally fail if those bytes are not decodable as ASCII. But as soon as you change the default codec, then bytes that are decodable as UTF-8 will also be converted and you do end up with Unicode objects where you thought you were producing byte values instead. – Martijn Pieters Apr 10 '15 at 13:41

So, the Python will not crash with non-ASCII strings anymore with `sys.setdefaultencoding("utf-8")`. I fail see how that this behaviour is bad for your example. In case of my application (Roundup) this is close to the crash I am trying to fix - http://stackoverflow.com/questions/28642781/hack-jinja2-to-encode-from-utf-8-instead-of-ascii – anatoly techtonik Apr 10 '15 at 13:44

@techtonik: we are going round in circles. You don't see this as bad, because you don't see how implicitly converting types can be bad. In a language where implicit conversions are the exception rather than the default, this is a **huge** issue, and you are changing the rules of that conversion at a global level. If this was configured per module instead, you'd be free to shoot yourself in the foot without also forcing the issue for any 3rd party library you may be using. But that's not the case here, and if you are not seeing a problem with such behaviour I don't know what to tell you. – Martijn Pieters Apr 10 '15 at 13:47

I see that things **can** be bad, but I don't see that there is a real world example of that changed behaviour was **desired** behaviour. In your example, the app will just crash on international symbol, which is happened in http://stackoverflow.com/questions/28642781/hack-jinja2-to-encode-from-utf-8-instead-of-ascii when we added Unicode templating layer to Roundup, and `sys.setdefaultencoding("utf-8")` is the only recommended **way to fix that crash**. What I am hearing from you is that the crash is desired behaviour. I can not agree on that, sorry. – anatoly techtonik Apr 10 '15 at 13:51

`the length you calculated is entirely wrong` is a good argument though. http://pastebin.ubuntu.com/10791721/ gives 3 and 6 on console. But this looks like a bug in Python, which is unable to handle mutibyte encodings. – anatoly techtonik Apr 10 '15 at 13:54

@techtonik: the desired behaviour would be to *fix Roundup*. If there is a bug in a 3rd party product, and the only work-around is to make a global change, then there is something wrong with that product. – Martijn Pieters Apr 10 '15 at 14:00

@techtonik: Why is that a bug in how Python handles a multibyte encoding? The length of a Unicode string should be the number of codepoints, not the number of bytes in an arbitrary codec. The length of a byte string should be the number of bytes. The Content Length header should contain the byte count, not the codepoint count. I don't see why this is a multi-byte vs. single-byte encoding issue. – Martijn Pieters Apr 10 '15 at 14:03

@techtonik: in your pastie are getting the length of byte strings, encoded to UTF-8. You get the same output without the `sys.setdefaultencoding()` call. – Martijn Pieters Apr 10 '15 at 14:07

Ok. So if we are not using `len()` for string processing, we are basically save to use `sys.setdefaultencoding("utf-8")` (which seems to be the case with Roundup core which seems to merely move utf-8 strings content from DB to the template layer). – anatoly techtonik Apr 12 '15 at 06:44

The problem with external libs will only appear if they use non-English chars themselves (badlib), or being fed `utf-8` string for processing. Which leads to question http://stackoverflow.com/questions/29586776/trace-functions-that-are-called-on-python-strings - how to trace that utf-8 strings are passed to external libs. – anatoly techtonik Apr 12 '15 at 06:46

The mentioned issue with Roundup is http://issues.roundup-tracker.org/issue2550811 - I'd like to know how'd you propose to fix it. – anatoly techtonik Apr 12 '15 at 06:47

@techtonik: using Jinja2 here reveals that Roundup is not practicing the *Unicode sandwich* approach; make *all text* in the application `unicode` at the point of entry as early as possible, and only encode to bytes at the point of exit, as late as possible. In this context, I recommend reading / seeing Ned Batchelder's [Pragmatic Unicode presentation](http://nedbatchelder.com/text/unipain.html). – Martijn Pieters Apr 12 '15 at 09:26

To be more precise "but the **byte** length you calculated is entirely wrong". Assuming that the number of bytes in a string is equal to the number of characters is generally a bad idea, but was safe if str is ascii. Trying to write code in py2 with unicode_literals and be unicode everywhere, it seems like changing the default encoding would be great -- but I guess my real problem is I introduced a `str` somewhere. Thanks for the enlightening explanation. – idbrii Dec 19 '15 at 15:41

You are assuming your source code is encoded to UTF-8 as well. Or that **all** your byte strings are UTF-8 encoded. Implicit encoding from Unicode to UTF-8, then concatenating that data with any other byte string using an arbitrary encoding would be a huge bug, and you masked it by setting the default encoding. – Martijn Pieters Apr 10 '15 at 11:26

Another issue is that code can *rely* on encoding or decoding errors to signal type differences. That includes 3rd party libraries. By setting a default encoding other than ASCII, you can no longer detect UTF-8 bytes -> Unicode and Unicode -> bytes implicit encodings where you meant to actually use explicit encodings. – Martijn Pieters Apr 10 '15 at 11:27

In any case, I've yet to come across a use-case where setting the default encoding was a better idea than handling encodings correctly. It's like using globals, you don't use them because *in practice* you significantly increase the *likelyhood* of bugs. – Martijn Pieters Apr 10 '15 at 11:29

So if testing ensures that your code works correctly with non-ASCII data, why not *go the extra step* and handle encoding and decoding correctly, and not mix types arbitrarily? Why rely on the `setdefaultencoding()` crutch at all? – Martijn Pieters Apr 10 '15 at 11:30

On the whole, I am not actually sure where you are going with this answer; yes, Unicode comparisons have their issues, but you are not actually saying anything clear about why `sys.setdefaultencoding()` should be avoided. – Martijn Pieters Apr 10 '15 at 11:38

thats right - the goal of my post was to make clear that 1. the answer to this question should be more balanced. 2. def.enc = utf-8 does not relief the developer of understanding byte and unicode string differences - for his own code 3. quality text processing is far more complex than novices might think even for the atomic operations like len() and comparisons. – Red Pill Apr 13 '15 at 14:26

Categorically refusing 1. is in my view neglecting the problems people have out there especially with tons of legacy code - I dare to claim that much Py2 code out there was written by people driven by solving a specific problem outside of text processing - with tons of str() operations inside... Further, pretty fashionable languages like go and rust these days prove that its possible to work in a 'utf-8 byte string sandwich' and use unicode functions only when needed, intermediately. – Red Pill Apr 13 '15 at 14:27

Python is of course not go or rust :-) I can see that there are legacy projects but that doesn't mean that when they get to unicode handling they should just set a global configuration that can have unintended consequences. Ferreting out the subtle bugs this can introduce are going to take just as much work as gating those sections and just decode your bytes to unicode objects at those points. That's at least the approach Plone is taking, for example. – Martijn Pieters Apr 13 '15 at 16:18

IMHO that's the best answer so far as it clearly shows the alternatives and consequences, as opposed to the _dangerland!_ arguments. Thank you. – miraculixx Jan 03 '16 at 19:32

unit test module imports main module that sets `sys.defaultencoding('utf-8')`, so why it doesn't work? – anatoly techtonik Feb 25 '15 at 15:13

Also, can you provide a real example where `sys.defaultencoding('utf-8')` doesn't work if somebody runs it as a module? – anatoly techtonik Feb 25 '15 at 15:20

@techtonik by the time tested module is imported, a bunch of other modules were imported and some other tests may have been ran. In addition, stdio was already initialised with system true default encoding. It's arguable you should not change default encoding on import at all, e.g. pydoc won't work right. Furthermore you should reset system to original state after your tests are done. In summary, if you only test your code and nothing else, and you only use implicit conversion for own data and not e.g. stdio, yes it may just work for you. But only you. – Dima Tisnek Feb 26 '15 at 07:37

"stdio was already initialised with system true default encoding" - isn't it always ascii? – anatoly techtonik Feb 26 '15 at 08:45

it seems that the real problem in your case is that all your unit tests are sharing the same interpreter. If unit test messes with global state, it should be isolated and run in separate interpreter. But for application scope all unit tests are consistent and use the same `sys.defaultencoding('utf-8')`. Also, note that I UTF-8 is critical for this question and it is backward compatible with ASCII. – anatoly techtonik Feb 26 '15 at 08:50

`sys.setdefaultencoding()` doesn't set input or output encoding; I think you misunderstood what the function *does*. It sets the codec used when *implicitly* encoding `unicode` to `str` or decoding `str` to `unicode` when mixing the types. – Martijn Pieters Apr 10 '15 at 12:52

Wether it works with unit tests or not is then dependent on the same factors as 3rd party libraries; if the code is *relying on ASCII being the default* then those tests may fail because that default was changed, globally. – Martijn Pieters Apr 10 '15 at 12:53

@techtonik re: mixing modules. Other modules are loaded first, they already imported `sys`. When your module runs, it's too late to change the encoding. Available hacks are `sitecustomize.py` and `reload(sys)`. The earlier doesn't work with unit tests and is not composable. The latter is black magic, you're on your own. – Dima Tisnek Apr 13 '15 at 10:07

Indeed stdio is initialised based on PYTHONIOENCODING and locale. Thanks, @MartijnPieters. – Dima Tisnek May 05 '15 at 08:31

Is it possible to overload `==` operator for u-strings so that they always exit with an error when the implicit conversion like this occurs? – anatoly techtonik Jun 30 '16 at 10:05

No, you can't. In python there is no way to change the definition of builtin type – Jiacai Liu Jun 30 '16 at 15:27

From what I observe from the above, we 'must' use `sys.setdefaultencoding("utf-8")` all the time in order to make `"你好" == u"你好"` as `True` which is correct – nehem Nov 17 '17 at 02:50

@nehemiah: Exactly!! Just like `3 == 3.0` is also `True`. Equaliity is a statement about the information itself and not about which datatype it is wrapped into. – Red Pill Jan 13 '18 at 13:23

2018 now and I still find it close to *insane*, that the same people who all the years refused to allow python the def.enc utf-8 switch, refused to repair broken behaviour like this, because it woud be **"dangerous"**.... `>>> print "abc" == u"abc" => True` `>>> print "你bc" == u"你bc" => False` ...are the same which, in their unicode sandwich idea, accept a silent `decode('utf-8')` in pretty much ANY I/O lib of Python3. – Red Pill Jan 13 '18 at 13:43

@nehemiah Better not. FYI I have updated my answer to provide a solution. – Jiacai Liu Jan 13 '18 at 23:47

@JiacaiLiu: http://utf8everywhere.org/ - the unicode sandwich idea, i.e. unnecessarily decode all text values at I/O (and leave it to the I/O libs to do decode('utf-8') silently, everywhere) is plain broken, compared to using unicode as an api when you need semantic meaning of values for humans, which is rarely the case in computing. Further: In times of microservices everywhere, I/O is everywhere and systems within processing pipelines care about *presence* of text values, not their semantic meaning for humans. Decoding makes no sense and is error prone, in 99%. – Red Pill Jan 14 '18 at 13:12

@JiacaiLiu Which one did you mean by solution? I notice the only solution to interface with Unicode in Python 2 is by `sys.setdefaultencoding("utf-8")` – nehem Jan 15 '18 at 23:49

@nehemiah https://pythonhosted.org/kitchen/unicode-frustrations.html#a-few-solutions – Jiacai Liu Jan 16 '18 at 02:37

@RedPill I agree with you, maybe we can use some libraries to help us deal with this. https://pythonhosted.org/kitchen/unicode-frustrations.html#example-putting-this-all-together-with-kitchen – Jiacai Liu Jan 16 '18 at 02:39

@JiacaiLiu kitchen is a well crafted library. Still, many of the "frustrations" addressed in your link are simply not present with the defaultencoding to utf-8 switch. The world has agreed on UTF-8 as omnipresent text data encoding meanwhile - and that is the reason why Python3 works at all: Check any I/O lib (redis, httpie, ...) and you'll see the .decode('utf-8') everywhere in order to pass values into their "unicode sandwhich". With Py2 & dflt.encoding utf8 this all is not necessary, ideal world. One can use unicode as API where needed and proper conversion is done by the language. – Red Pill Jan 16 '18 at 13:40

Dangers of sys.setdefaultencoding('utf-8')

5 Answers5

Updates

Summary of conclusions

Examples of successfully using a modified defaultencoding in the wild

Linked