Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)

Question

I've tried io, repr() etc, they don't work!

Problem inputting `å` (`\xe5`):

(None of these work)

import sys
print(sys.stdin.read(1))

sys.stdin = io.TextIOWrapper(sys.stdin.detach(), errors='replace', encoding='iso-8859-1', newline='\n')
print(sys.stdin.read(1))

x = sys.stdin.buffer.read(1)
print(x.decode('utf-8'))

They all give me roughly UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 0: unexpected end of data

Also tried starting Python with: export PYTHONIOENCODING=utf-8 doesn't work either.

Now, here's where i'm at:

import sys, codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
sys.stdin = codecs.getwriter("utf-8")(sys.stdin.detach())

x = sys.stdin.read(1)

print(x.decode('utf-8', 'replace'))

This gives me: ï¿½
It's close...

How can i take a \xe5 and turn it into å in my console? Without it breaking input() as well, because this solution breaks it.

Note: I know this has been asked before, but non of those solve it.. especially not io

Some info of my system

os.environ['LANG'] == 'C'
sys.getdefaultencoding() == 'utf-8'
sys.stdout.encoding == 'ANSI_X3.4-1968'
sys.stdin.encoding == 'ANSI_X3.4-1968'

My os: ArchLinux running xterm
Running locale -a gives me: C | POSIX | sv_SE.utf8

I've followed these:

(and a few 50 more)

Solution (sort of, still breaks `input()`)

sys.stdout = codecs.getwriter("latin-1")(sys.stdout.detach())
sys.stdin = codecs.getwriter("latin-1")(sys.stdin.detach())

x = sys.stdin.read(1)

print(x.decode('latin-1', 'replace'))

You are not entering UTF-8 data; that looks like Latin-1 instead. — Martijn Pieters, Aug 15 '13 at 20:09
What does `print sys.stdin.encoding` tell you *python* thinks your terminal codec is? — Martijn Pieters, Aug 15 '13 at 20:10
It is.. i think.. (iso-8859-1), but even "Latin-1" gives me trouble. So how to solve it? Cause i've been busting my chops all day about this, literately all day.. (Check my system info, at the bottom.. it's `ANSI_X3` — Torxed, Aug 15 '13 at 20:11
`ANSI_X3.4-1968` is ASCII. Basically, see http://en.wikipedia.org/wiki/ASCII#Aliases. Which is rather archaic. What platform is this? — Martijn Pieters, Aug 15 '13 at 20:11
You need to configure your terminal input locale: https://wiki.archlinux.org/index.php/Configuring_locales. Once configured, Python input Just Works, especially when using UTF8. — Martijn Pieters, Aug 15 '13 at 20:12
@MartijnPieters I have, running `sv_SE.utf8` acording to xterm. — Torxed, Aug 15 '13 at 20:13
@MartijnPieters Yea that to :S It says specificly `C | POSIX | sv_SE.utf8` and since "C" is ambigious to me.. — Torxed, Aug 15 '13 at 20:14
What sys.stdin.buffer.readline() returns? That are the bytes which are decoded by sys.stdin.encoding. You just need to find out proper encoding and let sys.stdin.encoding to be that one. — user87690, Aug 15 '13 at 20:15
@MartijnPieters But there _has_ to be a way in Python to re-encode this somehow? via `os.system('export...')` because i don't know what my clients will be using on their consoles? — Torxed, Aug 15 '13 at 20:15
@Torxed: Your `xterm` takes keyboard input and provides Python with encoded bytes based on the current locale. *That has to be right* before Python can do much with the input. — Martijn Pieters, Aug 15 '13 at 20:16
@MartijnPieters Explain to me, how this works just perfectly well in Python2.X then? Because if it can do it, there has to be a way in Python3 as well right? — Torxed, Aug 15 '13 at 20:17
@Torxed: this hasn't really changed other than than in Python 2, *no* decoding was done, in Python 3 you *are* decoding on input. — Martijn Pieters, Aug 15 '13 at 20:18
@Torxed: You can read from `sys.stdin.buffer` and not have it decode. And `\xe5` is a Latin-1 codepoint, **not** UTF-8. — Martijn Pieters, Aug 15 '13 at 20:18
@Torxed: Ah, [xterm does not support UTF-8](https://wiki.archlinux.org/index.php/Locale#Xterm_doesn.27t_support_UTF-8). Use a *different* terminal, or configure your locale to use Latin-1 instead, or run it as `uxterm` or `xterm -u8`. — Martijn Pieters, Aug 15 '13 at 20:19
@MartijnPieters can i write to `sys.stdout.buffer` as well then because `sys.stdin.buffer` works like a charm except when i try to print stuff out again i guess. And again in Python2 `print(sys.stdin.read(1))` worked, nothing else to it so how come that can write `å` to the console if Python3 can't and i would have to switch terminal? sounds counter productive if older versions of python can do the job but 3.x can't? Thx btw for stearing me towards latin-1, still don't know how to fix it tho :P — Torxed, Aug 15 '13 at 20:22
@Torxed: because you are writing raw bytes back to a terminal that is configured *with the same locale*. If you are reading Latin-1 bytes and write out Latin-1 bytes again, the terminal is happy enough. — Martijn Pieters, Aug 15 '13 at 20:23
@Torxed: **however** if you are reading UTF-8 and then try to count the number of characters in your input, you'll get strange counts as the number of bytes is **not** the same thing as the number of characters. — Martijn Pieters, Aug 15 '13 at 20:24
@Torxed: Then set the locale to `ISO-8859-1` instead, don't mess about so much in Python. — Martijn Pieters, Aug 15 '13 at 20:29

score 1 · Answer 1 · answered Aug 15 '13 at 20:24

You are running this in xterm, which does not support UTF-8 by default. Run it as xterm -u8 or use uxterm to fix that.

The other way to work around that, is to use a different locale; set your locale to Latin-1 for example:

export LANG=sv_SE.ISO-8859-1

but then you are limited to 256 codepoints, versus the full range (several million) of the Unicode standard.

Note that Python 2 never decoded the input; writing out what you read from the terminal will look fine because the raw bytes you read are interpreted by the terminal in the same locale; reading and writing Latin-1 bytes works just fine. That's not quite the same as processing Unicode data, however.

I'm getting close to it, and that export might do it. But argh i still get `sys.stdout.write(c)` -> `'ascii' codec can't encode character '\xe5' in position 0: ordinal not in range(128)` — Torxed, Aug 15 '13 at 20:29

Torxed · Accepted Answer · 2020-07-08T09:15:11.783

0

Went with a programatical approach in Python3 instead of changing the terminals codec:

import sys, codecs
sys.stdout = codecs.getwriter("latin-1")(sys.stdout.detach())
sys.stdin = codecs.getwriter("latin-1")(sys.stdin.detach())
sys.stdout.write(sys.stdin.read(1).decode('latin-1', 'replace'))

This does not only make you choose/match against your terminals "encoding", it actually requires no outside influence (such as export LANG=sv_SE.ISO-8859-1).

The only downside:

input('something: ')

Will break, fix for that is:

# Since it's bad practice to name function the
# same as __builtins__, we'll go ahead and call it something
# we're used to but isn't in use any more.
def raw_input(txt):
    sys.stdout.write(txt)
    sys.stdout.flush()
    sys.stdin.flush()
    return sys.stdin.readline().strip()

A big thanks to Martijn for telling why and that in fact the data is latin-1!

edited Jul 08 '20 at 09:15

answered Aug 15 '13 at 20:42

Torxed

22,866
14
82
131

But as you said, "I don't know what my clients will be using on their consoles". If they don't use 'latin-1', this won't work either. You are forcing Python to assume the terminal is 'latin-1'. Set the environment variable correctly for the terminal being used. That's what your clients would have to do as well. You can also set the environment variable `PYTHONIOENCODING` to force Python to use a specific encoding as well, but it better match the terminal. – Mark Tolonen Aug 17 '13 at 20:35
@MarkTolonen no but now i understand how to get the consoles encoding and use the *writer* to properly match that at the top of my main script. It makes my job a shit ton easier than before, at least i'm close to `Python2.7`'s ease of useage by rewriting the stdout to match the console. – Torxed Aug 18 '13 at 09:35
"...match that at the top of my main script." Do you mean the `#coding` line? That has nothing to do with the console if that is what you are referring to. That indicates the source file encoding only. – Mark Tolonen Aug 18 '13 at 13:43
"at the top of my main script" i place `sys.stdout = ...` and `sys.stdin = ...` which enables me to use `sys.stdout.write` as per normal, without any encoding issues. These 2-liners fixes the encoding issue once and for all. It's easy, sleek and simple.. And ofc i'm not talking about `#Coding`, if i would have i would have written so :) I'm talking about the two-line solution for this whole shabang that caused me headakes for days porting to `Python3.X` :) – Torxed Aug 18 '13 at 15:41

Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)

Problem inputting `å` (`\xe5`):

Now, here's where i'm at:

Some info of my system

Solution (sort of, still breaks `input()`)

2 Answers2

Linked

Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)

Problem inputting å (\xe5):

Now, here's where i'm at:

Some info of my system

Solution (sort of, still breaks input())

2 Answers2

Linked

Problem inputting `å` (`\xe5`):

Solution (sort of, still breaks `input()`)