26

I'm using Python 3 (recently switched from Python 2). My code usually runs on Linux but also sometimes (not often) on Windows. According to Python 3 documentation for open(), the default encoding for a text file is from locale.getpreferredencoding() if the encoding arg is not supplied. I want this default value to be utf-8 for a project of mine, no matter what OS it's running on (currently, it's always UTF-8 for Linux, but not for Windows). The project has many many calls to open() and I don't want to add encoding='utf-8' to all of them. Thus, I want to change the locale's preferred encoding in Windows, as Python 3 sees it.

I found a previous question "Changing the "locale preferred encoding"", which has an accepted answer, so I thought I was good to go. But unfortunately, neither of the suggested commands in that answer and its first comment work for me in Windows. Specifically, that accepted answer and its first comment suggest running chcp 65001 and set PYTHONIOENCODING=UTF-8, and I've tried both. Please see transcript below from my cmd window:

> py -i
Python 3.4.3 ...
>>> f = open('foo.txt', 'w')
>>> f.encoding
'cp1252'
>>> exit()

> chcp 65001
Active code page: 65001

> py -i
Python 3.4.3 ...
>>> f = open('foo.txt', 'w')
>>> f.encoding
'cp1252'
>>> exit()

> set PYTHONIOENCODING=UTF-8

> py -i
Python 3.4.3 ...
>>> f = open('foo.txt', 'w')
>>> f.encoding
'cp1252'
>>> exit()

Note that even after both suggested commands, my opened file's encoding is still cp1252 instead of the intended utf-8.

Community
  • 1
  • 1
walrus
  • 2,945
  • 5
  • 18
  • 19
  • Maybe it is just my style but I'd prefer to write a wrapper open() function in which you specify the encoding. – x squared Jul 17 '15 at 07:19
  • 2
    Don't use `chcp 65001`. The Windows console does not properly support UTF-8, and it's not doing what you want anyway. `locale.getpreferredencoding` has nothing to do with the console codepage; it's based on the Windows locale's ANSI encoding. For example, if you call Win32 `CreateFileA` (ANSI) instead of `CreateFileW` (UTF-16), the file path string gets decoded as an ANSI string (e.g. Windows-1252). Windows does not allow UTF-8 to be used as the ANSI character set, and the C runtime also doesn't allow using UTF-8 for a locale. – Eryk Sun Jul 17 '15 at 13:43
  • 3
    @eryksun Thanks for the info, but it has too much Windows-specific jargon for me. I rarely use Windows. All I want is a way to say to either Windows 8 or to Python 3: "Dear Windows 8 / Python 3, Please be informed that all the text files on this computer should be encoded in UTF-8 without exception. Please remember this fact in the future when opening text files. Thanks." – walrus Jul 18 '15 at 01:08
  • 1
    @walrus, no such thing exists. The native string format on Windows is UTF-16, using 16-bit `wchar_t` strings. The Windows API only supports 8-bit encodings for the legacy ANSI API, which unfortunately does not allow UTF-8. Python's preferred encoding is simply calling [`GetACP`](https://msdn.microsoft.com/en-us/library/dd318070) to get the ANSI codepage. I sympathize with you and wish that `io.TextIOWrapper` defaulted to UTF-8 on all platforms (your assumption about Linux isn't always valid, either). As things stand you need a wrapper function, as previously suggested. – Eryk Sun Jul 18 '15 at 01:40
  • @eryksun Your Windows details are over my head, as before. But you seem confident that there's no way to do what I want, either in Windows 8 or in Python 3. (I wouldn't have necessarily expected that it would be possible, except the previous thread I linked to gave me lots of false hope!) If you want to make an "answer" stating that this is impossible in Windows 8 and in Python 3 (except for hacks, of course), then I will accept that answer. – walrus Jul 18 '15 at 05:35
  • 1
    A bit of effort gets you to the [`TextIOWrapper`](https://hg.python.org/cpython/file/b4cbecbc0781/Modules/_io/textio.c#l893) source and therein to see that [`_Py_device_encoding`](https://hg.python.org/cpython/file/b4cbecbc0781/Python/fileutils.c#l36) is what uses the Windows console codepage (`GetConsoleCP`), but only for stdin, stdout, and stderr. Otherwise it calls [`getpreferredencoding`](https://hg.python.org/cpython/file/b4cbecbc0781/Lib/_bootlocale.py#l10), which calls [`_getdefaultlocale`](https://hg.python.org/cpython/file/b4cbecbc0781/Modules/_localemodule.c#l283) and thus `GetACP`. – Eryk Sun Jul 18 '15 at 07:11
  • @xsquared The problem with directly specifying the encoding to utf-8 is that then it ignores the BOM of other encodings even when it is present. – Jimmy He Jan 30 '23 at 22:10

5 Answers5

16

As of python3.5.1 this hack looks like this:

import _locale
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])

All files opened thereafter will assume the default encoding to be utf8.

axil
  • 1,558
  • 16
  • 17
  • 1
    Or better yet, `utf_8_sig` as it will take care of the BOM character that some Windows editors tend to inject into the files even for such an endian-neutral encoding as `utf8`. – axil Dec 17 '15 at 22:07
12

i know its a real hacky workaround, but you could redefine the locale.getpreferredencoding() function like so:

import locale
def getpreferredencoding(do_setlocale = True):
    return "utf-8"
locale.getpreferredencoding = getpreferredencoding

if you run this early on, all files opened after (at lest in my testing on a win xp machine) open in utf-8, and as this overrides the module method this would apply to all platforms.

James Kent
  • 5,763
  • 26
  • 50
  • 1
    I tested it on python 3.5.1 and windows 7 and [have a look](http://stackoverflow.com/a/34345136/4933641) what I ended up with. – axil Dec 17 '15 at 22:09
  • 1
    This does not seems to work, at leat on my windows 10 and pyhton 3.6.8 – sandwood Mar 18 '20 at 13:31
  • 1
    @sandwood have a look at axils answer above for one that works after python 3.5 – James Kent Mar 18 '20 at 15:26
  • Thanks for your help. Yes @axil answer works in 3.6.8 Curiously accordingly to python doc for 3.6 , your answer should work. – sandwood Mar 18 '20 at 16:29
  • This hack worked in a Google Colab Notebook that was giving an error: ``` /usr/local/lib/python3.8/dist-packages/google/colab/_system_commands.py in _run_command(cmd, clear_streamed_output) 161 locale_encoding = locale.getpreferredencoding() 162 if locale_encoding != _ENCODING: --> 163 raise NotImplementedError( 164 'A UTF-8 locale is required. Got {}'.format(locale_encoding)) 165 NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968 ``` – taylor Feb 10 '23 at 22:44
5

Locale can be set in windows globally to UTF-8, if you so desire, as follows:

Control panel -> Clock and Region -> Region -> Administrative -> Change system locale -> Check Beta: Use Unicode UTF-8 ...

After this, and a reboot, I confirmed that locale.getpreferredencoding() returns 'cp65001' (=UTF-8) and that functions like open default to UTF-8.

JBSnorro
  • 6,048
  • 3
  • 41
  • 62
2

The post is old but the issue is still of actuality (under Python 3.7 and Windows 10).

I've improved the solution as follows, making sure that the language/country part isn't overwritten but only the encoding, and also to make sure that it is only done under Windows:

if os.name == "nt":
    import _locale
    _locale._gdl_bak = _locale._getdefaultlocale
    _locale._getdefaultlocale = (lambda *args: (_locale._gdl_bak()[0], 'utf8'))

Hope this helps...

Eric L.
  • 61
  • 3
1

As of Python 3.7, you may want to use UTF-8 mode by setting an environment variable or passing a flag to Python. Note that it turns a few more things into using utf-8 other than just locale.getpreferredencoding, but that may well be a good thing. As of Python 3.15, UTF-8 mode is set to become the default.

JanKanis
  • 6,346
  • 5
  • 38
  • 42