15

I have a subprocess command that outputs some characters such as '\xf1'. I'm trying to decode it as utf8 but I get an error.

s = '\xf1'
s.decode('utf-8')

The above throws:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 0: unexpected end of data

It works when I use 'latin-1' but shouldn't utf8 work as well? My understanding is that latin1 is a subset of utf8.

Am I missing something here?

EDIT:

print s # ñ
repr(s) # returns "'\\xa9'"
tchrist
  • 78,834
  • 30
  • 123
  • 180
trinth
  • 5,919
  • 9
  • 40
  • 45
  • 3
    `My understanding is that latin1 is a subset of utf8` no it's not ASCII is a subset of utf8. – mouad Aug 23 '11 at 15:26
  • but shouldn't utf8 have a character at that code point though? – trinth Aug 23 '11 at 15:32
  • 1
    It's the first byte of a multi-byte sequence in utf-8, so it's not valid by itself. – agf Aug 23 '11 at 15:34
  • 1
    @trinth: Many unicode characters encoded in utf-8 are two or more bytes long. `\xf1` might be part of such a character, even though `\xf1` by itself does not decode. Can you post a longer snippet of the `repr` of the output and (if possible) what it is suppose to represent? – unutbu Aug 23 '11 at 15:39
  • I added the output above. The output is of type string and can contain many other characters. The code point above is just an example. – trinth Aug 23 '11 at 15:49
  • **As soon as you have to worry about encodings and decodings on a case-by-base basis, you have *almost certainly* gone down the wrong track.** Just set eveything up once and for all, and then leave the encoding jazz alone. You should not have to be doing explicit encoding or decoding. That only happens with things like databases that require raw byte strings because you have no translation layer established. – tchrist Aug 23 '11 at 16:29
  • @tchrist: subprocess output always shows up as byte strings. So yes, you do need to decode it. – Thomas K Aug 23 '11 at 16:33
  • @Thomas: That’s a [known bug](http://bugs.python.org/issue6135) with subprocesses in Python. You *should* be able to attach an encoding/decoding to any stream, the way you can with `OutputStreamWriter` and `InputStreamReader` in Java or with `binmode` in Perl. Otherwise you cannot use all streams transparently if you cannot promote byte streams to character streams. There should not be 1st-class streams vs 2nd-class streams; there should only be streams. Nobody should have to do manual encoding/decoding; this should be a stream property. The current situation is unacceptably st∞pid. – tchrist Aug 23 '11 at 16:48
  • @tchrist: Well, you can wrap it in a `codecs.StreamReader` subclass to do the decoding, but that's just a different style of doing the same thing. Conceptually, pipes are bytestreams, with no encoding attached. They're not unicode, although they might represent unicode. – Thomas K Aug 23 '11 at 16:59
  • @Thomas: It’s all super awkward. But for my part I’ll take `python3.2 -c 'howdy = "hello nin\u0301o\n"; from os import popen; kid_both = popen("uniquote -v | cat -n", "w"); kid_both.write(howdy)'` over `python3.2 -c 'howdy = "hello nin\u0301o\n"; from subprocess import Popen, PIPE; kid_head = Popen(["uniquote", "-v"], stdin=PIPE); kid_tail = Popen(["cat", "-n"], stdin=kid _head.stdout); kid_head.communicate(howdy.encode("UTF-8"))'` any day of the week to print `"hello nin\N{COMBINING ACUTE ACCENT}o"`. Being unable to send a simple string—even plain ASCII—without repeated `encode`s **grates**. – tchrist Aug 23 '11 at 17:44
  • @tchrist: Accepting pure-ascii unicode for bytes is what Python 2 did, and that's a mess, because English speaking programmers like me always forget to test with non-ascii characters, leaving code that will crash hard when it meets an `é`. Assuming UTF-8 might be OK (it looks like subprocess does that if you specify `universal_newlines=True`). – Thomas K Aug 23 '11 at 20:04

5 Answers5

9

You have confused Unicode with UTF-8. Latin-1 is a subset of Unicode, but it is not a subset of UTF-8. Avoid like the plague ever thinking about individual code units. Just use code points. Do not think about UTF-8. Think about Unicode instead. This is where you are being confused.

Source Code for Demo Program

Using Unicode in Python is very easy. It’s especially with Python 3 and wide builds, the only way I use Python, but you can still use the legacy Python 2 under a narrow build if you are careful about sticking to UTF-8.

To do this, always your source code encoding and your output encoding correctly to UTF-8. Now stop thinking of UTF-anything and use only UTF-8 literals, logical code point numbers, or symbolic character names throughout your Python program.

Here’s the source code with line numbers:

% cat -n /tmp/py
     1  #!/usr/bin/env python3.2
     2  # -*- coding: UTF-8 -*-
     3  
     4  from __future__ import unicode_literals
     5  from __future__ import print_function
     6  
     7  import sys
     8  import os
     9  import re
    10  
    11  if not (("PYTHONIOENCODING" in os.environ)
    12              and
    13          re.search("^utf-?8$", os.environ["PYTHONIOENCODING"], re.I)):
    14      sys.stderr.write(sys.argv[0] + ": Please set your PYTHONIOENCODING envariable to utf8\n")
    15      sys.exit(1)
    16  
    17  print('1a: el ni\xF1o')
    18  print('2a: el nin\u0303o')
    19  
    20  print('1a: el niño')
    21  print('2b: el niño')
    22  
    23  print('1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o')
    24  print('2c: el nin\N{COMBINING TILDE}o')

And here are print functions with their non-ASCII characters uniquoted using the \x{⋯} notation:

% grep -n ^print /tmp/py | uniquote -x
17:print('1a: el ni\xF1o')
18:print('2a: el nin\u0303o')
20:print('1b: el ni\x{F1}o')
21:print('2b: el nin\x{303}o')
23:print('1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o')
24:print('2c: el nin\N{COMBINING TILDE}o')

Sample Runs of Demo Program

Here’s a sample run of that program that shows the three different ways (a, b, and c) of doing it: the first set as literals in your source code (which will be subject to StackOverflow’s NFC conversions and so cannot be trusted!!!) and the second two sets with numeric Unicode code points and with symbolic Unicode character names respectively, again uniquoted so you can see what things really are:

% python /tmp/py
1a: el niño
2a: el niño
1b: el niño
2b: el niño
1c: el niño
2c: el niño

% python /tmp/py | uniquote -x
1a: el ni\x{F1}o
2a: el nin\x{303}o
1b: el ni\x{F1}o
2b: el nin\x{303}o
1c: el ni\x{F1}o
2c: el nin\x{303}o

% python /tmp/py | uniquote -v
1a: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2a: el nin\N{COMBINING TILDE}o
1b: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2b: el nin\N{COMBINING TILDE}o
1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2c: el nin\N{COMBINING TILDE}o

I really dislike looking at binary, but here is what that looks like as binary bytes:

% python /tmp/py | uniquote -b
1a: el ni\xC3\xB1o
2a: el nin\xCC\x83o
1b: el ni\xC3\xB1o
2b: el nin\xCC\x83o
1c: el ni\xC3\xB1o
2c: el nin\xCC\x83o

The Moral of the Story

Even when you use UTF-8 source, you should think and use only logical Unicode code point numbers (or symbolic named characters), not the individual 8-bit code units that underlie the serial representation of UTF-8 (or for that matter of UTF-16). It is extremely rare to need code units instead of code points, and it just confuses you.

You will also get more reliably behavior if you use a wide build of Python3 than you will get with alternatives to those choices, but that is a UTF-32 matter, not a UTF-8 one. Both UTF-32 and UTF-8 are easy to work with, if you just go with the flow.

tchrist
  • 78,834
  • 30
  • 123
  • 180
4

UTF-8 is not a subset of Latin-1. UTF-8 encodes ASCII with the same single bytes. For all other code points, it's all multiple bytes.

Put simply, \xf1 is not valid UTF-8, as Python tells you. "Unexpected end of input" indicates that this byte marks the beginning of a multi-byte sequence which is not provided.

I recommend you read up on UTF-8.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
1

My understanding is that latin1 is a subset of utf8.

Wrong. Latin-1, aka ISO 8859-1 (and sometimes erroneously as Windows-1252), is not a subet of UTF-8. ASCII, on the other hand, is a subset of UTF-8. ASCII strings are valid UTF-8 strings, but generalized Windows-1252 or ISO 8859-1 strings are not valid UTF-8, which is why s.decode('UTF-8') is throwing a UnicodeDecodeError.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
1

It's the first byte of a multi-byte sequence in UTF-8, so it's not valid by itself.

In fact, it's the first byte of a 4 byte sequence.

Bits Last code point Byte 1   Byte 2   Byte 3   Byte 4   Byte 5   Byte 6
21   U+1FFFFF        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

See here for more info.

Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
1

the easy way (python 3)

s='\xf1'
bytes(s, 'utf-8').decode('utf-8')
#'ñ'

if you are trying decode escaped unicode you can use:

s='Autom\\u00e1tico'
bytes(s, "utf-8").decode('unicode-escape')
#'Automático'
Adán Escobar
  • 1,729
  • 9
  • 15