10

What is the difference between u'' prefix and unicode()?

# -*- coding: utf-8 -*-
print u'上午'  # this works
print unicode('上午', errors='ignore') # this works but print out nothing
print unicode('上午') # error

For the third print, the error shows: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0

If I have a text file containing non-ascii characters, such as "上午", how to read it and print it out correctly?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
DehengYe
  • 619
  • 2
  • 8
  • 22
  • http://stackoverflow.com/questions/761361/suppress-the-uprefix-indicating-unicode-in-python-strings – Praveen Aug 20 '15 at 07:19
  • The question you've added in your edit is sort of nonsense. What is the encoding of your `"\x97"` byte? Whatever your answer is, use that as the arguement to `unicode` (or `str.decode`) rather than `"utf-8"`. As Joel Spolsky wrote in [the post linked by Martijn Pieters](http://joelonsoftware.com/articles/Unicode.html): "**It does not make sense to have a string without knowing what encoding it uses.**" – Blckknght Aug 20 '15 at 08:59

4 Answers4

16
  • u'..' is a string literal, and decodes the characters according to the source encoding declaration.

  • unicode() is a function that converts another type to a unicode object, you've given it a byte string literal. It'll decode a byte string according to the default ASCII codec.

So you created a byte string object using a different type of literal notation, then tried to convert it to a unicode() object, which fails because the default codec for str -> unicode conversions is ASCII.

The two are quite different beasts. If you want to use the latter, you need to give it an explicit codec:

print unicode('上午', 'utf8')

The two are related in the same way that using 0xFF and int('0xFF', 0) are related; the former defines an integer of value 255 using hex notation, the latter uses the int() function to extract an integer from a string.

An alternative method would be to use the str.decode() method:

print '上午'.decode('utf8')

Don't be tempted to use an error handler (such as ignore' or 'replace') unless you know what you are doing. 'ignore' especially can mask underlying issues with having picked the wrong codec, for example.

You may want to read up on Python and Unicode:

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you. It works. But, I tried reading a text file containing non-ascii characters, such as "上午". For each line of the file, I do unicode(line, 'utf8'), it shows the same error as said in the question description. – DehengYe Aug 20 '15 at 07:41
  • 1
    @DehengYe: perhaps your file uses a *different codec*? And don't use `unicode()` to decode every line. You can use `import io`, then `io.open(filename, encoding='utf8')` (or another codec) to have Python decode the file contents for you as you read. – Martijn Pieters Aug 20 '15 at 07:42
  • I encounter new problems. Would you be able to see my new edits? Thank you. @MartijnPieters – DehengYe Aug 20 '15 at 08:36
  • @DehengYe: please don't ask new questions in your current post; use new posts for that. You cannot convert that byte because it is *not UTF-8*. You have a different codec there, you'll have to figure out what data you do have. Don't try to decode it as UTF-8. – Martijn Pieters Aug 20 '15 at 08:59
2

When a str is not prefixed by u'' in Python 2.7.x, what the interpreter sees is a byte string, without an explicit encoding.

If you do not tell the interpreter what to do with those bytes when executing unicode(), it will (as you saw) default to trying to decode the bytes it sees via the ascii encoding scheme.

It does so as a preliminary step in trying to turn the plain bytes of the str into a unicode object.

Using ascii to decode means: try to interpret each byte of the str using a hard-coded mapping, a number between 0 and 127.

The error you encountered was like a dict KeyError: the interpreter encountered a byte for which the ascii encoding scheme does not have a specified mapping.

Since the interpreter doesn't know what to do with the byte, it throws an error.

You can change that preliminary step by telling the interpreter to decode the bytes using another set of encoding/decoding mappings instead, one that goes beyond ascii, such as UTF-8, as elaborated in other answers.

If the interpreter finds a mapping in the chosen scheme for each byte (or bytes) in the str, it will decode successfully, and the interpreter will use the resulting mappings to produce a unicode object.

A Python unicode object is a series of Unicode code points. There are 1,112,064 valid code points in the Unicode code space.

And if the scheme you choose for decoding is the one with which your text (or code points) were encoded, then the output when printing should be identical to the original text.

Can also consider trying Python 3. The relevant difference is explained in the first comment below.

scharfmn
  • 3,561
  • 7
  • 38
  • 53
  • 1
    Python 3 doesn't default *strings* to utf-8 - rather, a `str` is unicode text (a sequence of unicode codepoints), like the 2.x `unicode` type. The default UTF8 in Python 3 refers to the *source file* encoding if you don't use an [encoding declaration](http://legacy.python.org/dev/peps/pep-0263/) (it was ASCII before Python 3 and iso 8859-1 pre-2.5). – lvc Aug 20 '15 at 08:07
0

Unicode is an object type whereas 'u' is a literal used to denote that object is unicode object. It is similar to 'L' literal used to denote long int.

hspandher
  • 15,934
  • 2
  • 32
  • 45
  • Except the `L` prefix isn't needed to create a long int if your literal is > `sys.maxint`, and `long(123)` doesn't make a translation step that can be regulated with an extra argument. – Martijn Pieters Aug 20 '15 at 07:29
  • Moreover, for all but indexing operations and certain C calls, long integers and regular integers are all but the exact same thing. – Martijn Pieters Aug 20 '15 at 07:30
0

Please try:'上午'.decode('utf8','ignore').encode('utf8')

futurelj
  • 273
  • 5
  • 14
  • Would you like to elaborate a bit? Thank you. – DehengYe Aug 20 '15 at 07:37
  • I tried, this actually works when I read a file containing non-ascii code. So I upvote it. – DehengYe Aug 20 '15 at 07:43
  • Why the additional encode? Sure `'byte str literal'.decode('source codec')` will do the same thing as `unicode('byte str literal', 'source codec')`. – Martijn Pieters Aug 20 '15 at 07:43
  • @DehengYe: that's because the `'ignore'` error handler will cause **any** input to work. It doesn't mean that you got usable text. – Martijn Pieters Aug 20 '15 at 07:44
  • 1
    And why *'ignore'*? That just masks any problems, what if the input is not actually UTF-8? – Martijn Pieters Aug 20 '15 at 07:44
  • @MartijnPieters thank you so much for your dedication. – DehengYe Aug 20 '15 at 07:48
  • @DehengYe: Say your input file uses the [GBK codec](https://en.wikipedia.org/wiki/GBK), then using `line.decode('utf8', 'ignore')` will give you a result, but it won't be any *useful* result. `u'上午'.encode('gbk').decode('utf8', 'ignore')` gives you an *empty string* as all the bytes in the encoded string are not valid UTF-8 bytes, so they all are ignored when decoding. But GBK is one of the most common codecs used for Chinese text. – Martijn Pieters Aug 20 '15 at 07:49