1

As far as I know there is a difference between strings and unicode strings in Python. But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?

So when I get a text input, I don't need to use unicode()?

I might sound lazy but I am just interested if this is possible...

p.s. I don't know a lot about character encoding so please correct me if I got anything wrong

Cosinux
  • 321
  • 1
  • 4
  • 16
  • 3
    Yes, simply use Python 3. It doesn't have non-unicode strings. – Stefan Pochmann Jul 01 '16 at 00:58
  • But what if I prefer using Python 2? – Cosinux Jul 01 '16 at 01:02
  • 1
    @Cosinux. Have you actually used Python 3? If so, what specific problems did you have with it that made you prefer Python 2? – ekhumoro Jul 01 '16 at 01:33
  • What Stefan said. Unicode requires different handling to simple ASCII strings. If you don't like the way Python 2 does it then you should be using Python 3. For that matter, you should be using Python 3 for _all_ new code. The only reason to use Python 2 these days is if you're forced to work on legacy code, or you need to use some obscure library that hasn't been ported. But you should take a look at [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html) by SO veteran Ned Batchelder. – PM 2Ring Jul 01 '16 at 01:35
  • @StefanPochmann That's not correct, both Python 2 and 3 have both byte strings and unicode strings, you have `'abc'` and `u'abc'` in Python 2, and `b'abc'` and `'abc'` in Python 3. – roeland Jul 01 '16 at 01:35
  • Here's a bit of reading for a start → https://docs.python.org/2/howto/unicode.html – roeland Jul 01 '16 at 01:37
  • FWIW, you can also use the `u'abc'` syntax in recent versions of Python 3; this makes it a little easier to write code that runs correctly on Python 2 and 3. – PM 2Ring Jul 01 '16 at 01:38
  • I am used to Python 2. Also all my old scripts are written in Python 2. But that is not the point. I just wondered if it was possible... – Cosinux Jul 01 '16 at 01:39
  • @roeland Fair point, however Python 3's `bytes` objects are somewhat different to Python 2 strings. – PM 2Ring Jul 01 '16 at 01:40
  • @roeland No, b'abc' isn't a string, it's a bytes. – Stefan Pochmann Jul 01 '16 at 03:48
  • @StefanPochmann: Python 3 as well as Python 2 has bytestrings. Python 2 as well as Python 3 has Unicode strings. The only difference is that without `from __future__ import unicode_literals` you get a bytestrings if you create a string using `"abc"` **literal** on Python 2 while the same literal creates a Unicode string on Python 3. You can create a bytestring from a literal using `b'abc'` on both Python 2 and 3. You can create a Unicode string from a literal using `u'abc'` on both Python and 3. – jfs Jul 01 '16 at 12:05
  • @J.F.Sebastian Again: b'abc' isn't a string, it's a bytes. – Stefan Pochmann Jul 01 '16 at 17:53
  • @StefanPochmann please, stop spreading misinformation. `b"abc"` is a string (a bytestring) and yes, the type is called `bytes` (a bytestring is an immutable sequence of bytes). – jfs Jul 01 '16 at 18:31
  • @J.F.Sebastian As far as I can tell, you're the one spreading misinformation. Show me where the Python 3 docs call bytes a string. I had already checked and as far as I can tell, they don't. Also, PEP 358, which I think is very relevant and is co-authored by Guido, says for example *"in Python 2.6, we have two string types, str and unicode, while in Python 3.0 we will only have **one** string type, whose name will be str"*. Seems quite clear to me. – Stefan Pochmann Jul 01 '16 at 20:43
  • @StefanPochmann why do you think it is relevant whether or not Python 3 docs call bytes a string? I would understand if they don't (to avoid the confusion with Unicode strings, to stress the difference between the text type and binary data). Though they do call bytes and related types a string:`typing.ByteString`. A byte string is a byte string whatever Python version is (e.g., common string operation from Python 2 are there and printf-like formatting is back. It is fully inline with [the notion of string used in programming](https://en.wikipedia.org/wiki/String_%28computer_science%29) – jfs Jul 01 '16 at 21:51

3 Answers3

3

For Example(In pyhon interactive,diff in GUI Shell) :

>>> s = '你好'
>>> s
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> us = u'你好'
>>> us
u'\u4f60\u597d'
>>> print type(s)
<type 'str'>
>>> print type(us)
<type 'unicode'>
>>> len(s)
6
>>> len(us)
2

In short:
First, a string object is a sequence of characters,a Unicode string is a sequence of code points(Unicode code units), which are numbers from 0 to 0x10ffff.
Them, len(string) will reture a set of bytes,len(unicode) will return a number of characters.This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I think you should use raw_input to instead input, if you want to get bytestring.

Lily
  • 31
  • 2
2

In Python 2.6+ you can use from __future__ import unicode_literals, but that only makes string literals Unicode. All functions that returned byte strings still return byte strings.

Example:

>>> s = 'abc'
>>> type(s)
<type 'str'>
>>> from __future__ import unicode_literals
>>> s = 'abc'
>>> type(s)
<type 'unicode'>

For the behavior you want, use Python 3.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
2

But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?

There are two type of strings in Python (on both Python 2 and 3): a bytestring (a sequence of bytes) and a Unicode string (a sequence of Unicode codepoints).

bytestring = b'abc'
unicode_text = u'abc'

The type of string created using 'abc' string literal depends on Python version and the presence of from __future__ import unicode_literals import. Without the import on Python 2, 'abc' literal creates a bytestring otherwise it creates a Unicode string.

Add the encoding declaration at the top of your Python source file if you use non-ascii characters in string literals e.g.: # -*- coding: utf-8 -*-.

So when I get a text input, I don't need to use unicode()?

If by "text input" you mean that your program receives bytes somehow (from a file, network, from the command-line) then no: you shouldn't rely on Python to convert bytes to Unicode implicitly -- you should do it explicitly as soon as you receive the bytes using unicode_text = bytestring.decode(character_encoding).

And in reverse, keep the text as Unicode inside your program. Convert Unicode strings to bytes as late as possible when it is necessary (e.g., to send the text via the network).

Use bytestrings to work with a binary data: an image, a compressed content, etc. Use Unicode strings to work with text in Python.

To read Unicode from a file, use io.open() (you have to know the correct character encoding if it is not locale.getpreferredencoding(False)).

What character encoding to use when you receive your Unicode text via network may depend on the corresponding protocol e.g., the charset can be specified in Content-Type http header:

    text = data.decode(response.headers.getparam('charset'))

You could use universal_newlines=True or io.TextIOWrapper() to get Unicode text from an external process started using subprocess module. It can be non-trivial to figure out what character encoding should be used on Windows (if you read Russian, see the gory details here: Byte при печати вывода внешней команды).

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670