17

I already know r"string" in Python 2.7 often used for regex patterns. I also have seen u"string" for, I think, Unicode strings. Now with Python 3 we see b"string".

I have searched for these in different sources / questions, such as What does a b prefix before a python string mean?, but it's difficult to see the big picture of all these strings with prefixes in Python, especially with Python 2 vs 3.

Question: would you have a rule of thumb to remember the different types of strings with prefixes in Python? (or maybe a table with a column for Python 2 and one for Python 3?)

NB: I have read a few questions+answers but I haven't found an easy to remember comparison with all prefixes / Python 2+3

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 3
    Euhm, `r"string"` doesn't mean regex does it? It stands for "raw string". `u"string"` stands for unicode string and `b"string"` stands for bytes "string". It's in the documentation. A rule of thumb is, the letter equals the name of the representation. u for unicode. b for bytes and r for raw. You get used to it after using them more than twice I'd say. And is a different between python2/python3. – Torxed Feb 05 '19 at 11:44
  • @Torxed Yes, but these are different for Python 2 and 3, etc. and having an easy-to-refer-to comparison table would be super useful. – Basj Feb 05 '19 at 11:45
  • @Torxed We often have `u"string"` with Python 2.7, and `"string"` in Python 3 when running the same code, so it would make sense to have a definitive comparison table showing the different cases and the different versions 2 vs 3. It would be much easier than reading the Py2 doc, reading the Py3 doc, then reading a few questions+answers + figure out where are exactly the differences between 2 and 3. – Basj Feb 05 '19 at 11:47
  • There's to much to go through in order to make a complete list of differences. And deciding what's important in a cheat sheet and what is not - is up to every person. Create one for yourself, that is written in a way that you'll understand and feel it's useful. They already [go through](https://wiki.python.org/moin/Python2orPython3) the changes and why they were made. There's hundreds of articles describing the changes on their forums/mailing lists and documentation site. Every different aspect is written in the docs on every function/definition so people know what "the old syntax" was. – Torxed Feb 05 '19 at 11:49

4 Answers4

16

From the python docs for literals: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and treat backslashes as literal characters. As a result, in string literals, '\U' and '\u' escapes in raw strings are not treated specially. Given that Python 2.x’s raw unicode literals behave differently than Python 3.x’s the 'ur' syntax is not supported.

and

A string literal with 'f' or 'F' in its prefix is a formatted string literal; see Formatted string literals. The 'f' may be combined with 'r', but not with 'b' or 'u', therefore raw formatted strings are possible, but formatted bytes literals are not.

So:

  • r means raw
  • b means bytes
  • u means unicode
  • f means format

The r and b were already available in Python 2, as such in many other languages (they are very handy sometimes).

Since the strings literals were not unicode in Python 2, the u-strings were created to offer support for internationalization. As of Python 3, u-strings are the default strings, so "..." is semantically the same as u"...".

Finally, from those, the f-string is the only one that isn't supported in Python 2.

Vitor SRG
  • 664
  • 8
  • 9
  • Thanks! Can you just add which ones are available on 2, and which ones are available on 3? It could also be interesting to add the default for both i.e. what is `"string"` equivalent to in 2 vs 3? – Basj Feb 05 '19 at 11:53
13
  1. u-strings if for unicode in python 2. Most probably you should forget this, if you're working with modern applications — default strings in python 3 is all unicode, and if you're migrating from python 2, you'll most probably use from __future__ import unicode_literals, which makes [almost] the same for python 2

  2. b-strings is for raw bytes — have no idea of text, rather just stream of bytes. Rarely used as input for your source, most often as result of network or low-level code — reading data in binary format, unpacking archives, working with encryption libraries.

    Moving from/to b-string to str done via

    # python 3
    >>> 'hēllö'.encode('utf-8')
    b'h\xc4\x93ll\xc3\xb6'
    
    >>> b'h\xc4\x93ll\xc3\xb6'.decode()
    'hēllö'
    
    # python 2 without __future__
    >>> u'hēllö'.encode('utf-8')
    'h\xc4\x93ll\xc3\xb6'
    
    >>> 'h\xc4\x93ll\xc3\xb6'.decode('utf-8')
    u'h\u0113ll\xf6'  # this is correct representation
    
  3. r-strings is not specifically for regex, this is "raw" string. Unlike regular string literals, r-string doesn't give any special meaning for escape characters. I.e. normal string 'abc\n' is 4 characters long, last char is "newline" special character. To provide it in literal, we're using escaping with \. For raw strings, r'abc\n' is 5-length string, last two characters is literally \ and n. Two places to see raw strings often:

  • regex patterns — to not mess escaping with actual special characters in patters

  • file path notations for windows systems, as windows family uses \ as delimeter, normal string literals will look like 'C:\\dir\\file', or '\\\\share\\dir', while raw would be nicer: r'C:\dir\file' and r'\\share\dir' respectively

  1. One more notable is f-strings, which came to life with python 3.6 as simple and powerful way of formatting strings:
  • f'a equals {a} and b is {b}' will substitute variables a and b in runtime.
Slam
  • 8,112
  • 1
  • 36
  • 44
  • Thanks! So `u"hello"` in Py2.7 is strictly equivalent to `"hello"` in Py3? – Basj Feb 05 '19 at 11:55
  • 2
    Yes. And even in python 2, you can do `from __future__ import unicode_literals`, and all `'hello'` in that module will equivalent to `u'hello'`. Tho, "strictly" has a vague meaning. Type name is different, some methods may differ, but we're comparing different major version of language, so... mostly they're interchangeble – Slam Feb 05 '19 at 11:56
4

There are really only two types of string (or string-like object) in Python.

The first is 'Unicode' strings, which are a sequence of characters. The second is bytes (or 'bytestrings'), which are a sequence of bytes.

The first is a series of letter characters found in the Unicode specification. The second is a series of integers between 0 and 255 that are usually rendered to text using some assumed encoding such as ASCII or UTF-8 (which is a specification for encoding Unicode characters in a bytestream).

In Python 2, the default "my string" is a bytestring. The prefix 'u' indicates a 'Unicode' string, e.g. u"my string".

In Python 3, 'Unicode' strings became the default, and thus "my string" is equivalent to u"my string". To get the old Python 2 bytestrings, you use the prefix b"my string" (not in the oldest versions of Python 3).

There are two further prefixes, but they do not affect the type of string object, just the way it is interpreted. The first is 'raw' strings which do not interpret escape characters such as \n or \t. For example, the raw string r"my_string\n" contains the literal backslash and 'n' character, while "my_string\n" contains a linebreak at the end of the line.

The second was introduced in the newest versions of Python 3: formatted strings with the prefix 'f'. In these, curly braces are used to show expressions to be interpreted. For example, the string in:

my_object = 'avocado'
f"my {0.5 + 1.0, my_object} string"

will be interpreted to "my (1.5, avocado) string" (where the comma created a tuple). This interpretation happens immediately when the code is read; there is nothing special subsequently about the string.

And finally, you can use the multiline string notation:

"""this is my
multiline
string"""

with 'r' or 'f' specifiers as you wish.

In Python 2, if you have used no prefix or only an 'r' prefix, it is a bytestring, and if you have used a 'u' prefix it is a Unicode string.

In Python 3, if you have used no prefix or only a combination of 'r', 'f' and 'u', it is a Unicode string. If you have used a 'b' prefix it is a bytestring. Using both 'b' and 'u' is obviously not allowed.

Andrew McLeod
  • 360
  • 1
  • 3
  • 12
  • Thank you for this answer! So as a rule of thub, Py2's `u"hello"` = Py3's `"hello"` and Py2's `"hello"` = Py3's `b"hello"`, right? – Basj Feb 05 '19 at 12:01
  • Yes. The 'u' prefix is not supported in Python 3.0 to 3.2, but was re-introduced in Python 3.3. The 'b' prefix is also supported in Python 2.6 and 2.7 (but is ignored) - it is included to allow compatibility with Python 3. The 'f' prefix was only introduced in Python 3.6 and likely to be further improved to remove some limitations. – Andrew McLeod Feb 05 '19 at 12:06
2

This is what I observed (seems confirmed by other answers):

Python 2                       Python 3

-----------------------------------------------

"hello"                        b"hello"
b"hello"            <=>    
<type 'str'>                   <class 'bytes'>

-----------------------------------------------

u"hello"            <=>        "hello"
                               u"hello"
<type 'unicode'>               <class 'str'>
Basj
  • 41,386
  • 99
  • 383
  • 673