375

I am working with a library which returns a "byte string" (bytes) and I need to convert this to a string.

Is there actually a difference between those two things? How are they related, and how can I do the conversion?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Sheldon
  • 9,639
  • 20
  • 59
  • 96
  • 1
    See also [What does the 'b' character do in front of a string literal?](https://stackoverflow.com/q/6269765/774575) – mins Jun 01 '21 at 08:47

9 Answers9

715

The only thing that a computer can store is bytes.

To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:

  • If you want to store music, you must first encode it using MP3, WAV, etc.
  • If you want to store a picture, you must first encode it using PNG, JPEG, etc.
  • If you want to store text, you must first encode it using ASCII, UTF-8, etc.

MP3, WAV, PNG, JPEG, ASCII and UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc. in bytes.

In Python, a byte string is just that: a sequence of bytes. It isn't human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.

On the other hand, a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as ASCII and UTF-8.

'I am a string'.encode('ASCII')

The above Python code will encode the string 'I am a string' using the encoding ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as b'I am a string'. Remember, however, that byte strings aren't human-readable, it's just that Python decodes them from ASCII when you print them. In Python, a byte string is represented by a b, followed by the byte string's ASCII representation.

A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.

b'I am a string'.decode('ASCII')

The above code will return the original string 'I am a string'.

Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Zenadix
  • 15,291
  • 4
  • 26
  • 41
  • 111
    Zenadix deserves some kudos here. After some years functioning in this environment, his is the first explanation that clicked with me. I may tattoo it on my other arm (one arm already has "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky" – neil.millikin Jul 16 '15 at 12:06
  • 7
    Absolutely brilliant. Lucid and easy to understand. However, I would like to mention that this line - "If you print it, Python will represent it as b'I am a string'" is true for Python3 as for Python2 bytes and str are the same thing. – SRC Dec 17 '16 at 09:11
  • 12
    I am awarding you this bounty for offering a very human-readable explanation to put some clarity in this subject! – fedorqui Jan 08 '17 at 15:08
  • 5
    Great answer. The only thing that could perhaps be added is to point out more clearly that historically, programmers and programming languages have tended to explicitly or implicitly *assume that a byte sequence and an ASCII string were the same thing*. Python 3 decided to explicitly break this assumption, correctly IMHO. – nekomatic Jan 17 '17 at 09:39
  • 2
    IMHO, Python3 should've opted to print bytes as hexa values as a default behaviour with some easy function to convert to ascii or print in ascii. – HFSDev Mar 07 '17 at 18:56
  • 6
    Link to Joel's post mentioned by @neil.millikin above : https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ – Kshitij Saraogi May 07 '17 at 15:54
  • I really like this explanation. However, I think it doesn't correctly explain some behavior in python (2.7). For example, using `os.urandom(32)` creates a string (the `repr` of the returned bytes). To "decode" (using the the meaning in this post) to a `base64` string, one actually does `encode('base64')`. This is strange and is directly counter to what this post describes. – Dan Kowalczyk Oct 02 '17 at 03:56
  • superb explanation, before this i had some confusion but now clear. thanks zenadix – Athar Apr 21 '18 at 02:48
  • 1
    Give this man a cookie! No disrespect, thank you very much for your detailed explanation. – user3296487 May 16 '18 at 20:13
  • In case of strings everything is clear. We just encode some abstraction to bytes `'I am a string'.encode('ASCII')`. But what about image? Image is image and it already stored on disk. So what we encoding in case of image? – lalilulelo_1986 Apr 02 '20 at 19:53
  • One part I think might confuse some people: "A character string can't be directly stored in a computer...." At least to me, 'character string' is a term that _means_ 'human symbols next to each other in a computer'. So, all character strings must be bytes 'stored' in a computer, and they must have an implicit encoding, the computer couldn't display them (encoding) in, e.g., an editor. I think implying otherwise is confusing. In my mind, Python's `bytes` is just a way for programmers to be explicit about character encoding. – Hawkeye Parker Dec 29 '20 at 22:10
  • That is exactly what I was looking for. Is there a way then to print the byte string into bytes? b'hello' => 68 65 6c 6c 6f – zarathoustra Nov 22 '21 at 08:48
  • "a character string, often just called a "string", is a sequence of characters. It is human-readable. A character string can't be directly stored in a computer" - What does this even mean? Are not the character string already stored in the computer? Sure, They are already there present as bytes and based upon the implicit encoding scheme they are presented in stdout. And that is precisely what is happening in the case of byte string as well. So.. What is the difference? – figs_and_nuts Jan 15 '22 at 07:43
  • @zarathoustra - ```list(b'hello') = [104, 101, 108, 108, 111]``` – figs_and_nuts Jan 15 '22 at 07:51
  • This is the most clear explanation imho. However, I don't follow why `b'hi'.decode()`, `b'hi'.decode('utf8')` and `b'hi'.decode('ascii')` gives the same output. – Andrew Anderson Jan 27 '22 at 11:55
361

Assuming Python 3 (in Python 2, this difference is a little less well-defined) - a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can't be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes - things that can be stored on disk. The mapping between them is an encoding - there are quite a lot of these (and infinitely many are possible) - and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:

>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16')
'蓏콯캁澽苏'
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8')
'τoρνoς'

Once you know which one to use, you can use the .decode() method of the byte string to get the right character string from it as above. For completeness, the .encode() method of a character string goes the opposite way:

>>> 'τoρνoς'.encode('utf-8')
b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
lvc
  • 34,233
  • 10
  • 73
  • 98
  • 8
    To clarify for Python 2 users: the `str` type is the same as the `bytes` type; this answer is equivalently comparing the `unicode` type (does not exist in Python 3) to the `str` type. – craymichael Nov 10 '16 at 16:02
  • To be technically correct, unicode is not the default encoding, rather the `utf-8` encoding is the default character encoding to store unicode strings in memory. – Kshitij Saraogi May 10 '17 at 09:03
  • 4
    @KshitijSaraogi that isn't quite true either; that whole sentence was edited in and is a bit unfortunate. The in-memory representation of Python 3 `str` objects is not accessible or relevant from the Python side; the data structure is just a sequence of codepoints. Under [PEP 393](https://www.python.org/dev/peps/pep-0393/), the exact internal encoding is one of Latin-1, UCS2 or UCS4, and a utf-8 representation may be cached after it is first requested, but even C code is discouraged from relying on these internal details. – lvc May 10 '17 at 09:46
  • 3
    If they can't be directly stored on disk, so how are they stored in memory? – z33k Nov 04 '17 at 14:38
  • 2
    @orety they do have to be encoded *somehow* internally for exactly that reason, but this isn't expos3s to you from Python code much like you don't have to care about how floating point numbers are stored. – lvc Nov 05 '17 at 22:43
  • What is the default encoding in that case, i.e. the encoding used when reading lines from a file into a string? – HelloGoodbye Aug 01 '19 at 22:47
  • 1
    "these are an abstract concept" I disagree with this - it's not abstract at all. It exists in some form within the memory of the program. – Chris Stryczynski Feb 15 '20 at 16:10
  • 4
    @ChrisStryczynski see the comments above - sure they're stored in memory *somehow*, but that form is explicitly abstracted away. Indeed, these days, it can change during the lifetime of a program and be different between different strings or might even be more than one (some encodings are cached), depending on the characters in them - but the only time you need to worry about that is if you're hacking on the implementation of the string type itself. – lvc Feb 16 '20 at 07:56
  • 1
    I agree with @ChrisStryczynski. I understand the distinction you're making, but to imply that somehow a character string isn't bytes and doesn't have an encoding is confusing, at least to me. To be 'in a computer', a string must be bytes, and for anyone to read it, it must have _some_ character encoding. This is meaningful if, e.g., you try to print a Chinese UTF-8 string in a terminal, but get '??????'. To me, understanding this helps to clarify what strings and encodings are. In this sense, `byte` is just a way for a programmer to be explicit about a character encoding, for whatever reason. – Hawkeye Parker Dec 29 '20 at 22:06
  • I think that "abstract" is not the right word for it. In the same way a running python program would not commonly be referred to as an "abstract turing machine". Possibly just the implementation could be varied or hidden from the user. – Chris Stryczynski Dec 29 '20 at 23:21
  • @HelloGoodbye You may specify an `encoding` parameter for the `open` call; and you may see the documentation to understand the default value for that parameter. The file contents may use *any* encoding; it is your responsibility to know (or find out somehow) which encoding was used, and specify it. This is not in any way a Python-specific issue. Files can only contain raw data as a sequence of bytes. – Karl Knechtel Sep 03 '22 at 00:22
  • @ChrisStryczynski it is abstracted in the OOP sense: you are not supposed to know or care about what in-memory representation is used. – Karl Knechtel Sep 03 '22 at 00:24
32

Note: I will elaborate more my answer for Python 3 since the end of life of Python 2 is very close.

In Python 3

bytes consists of sequences of 8-bit unsigned values, while str consists of sequences of Unicode code points that represent textual characters from human languages.

>>> # bytes
>>> b = b'h\x65llo'
>>> type(b)
<class 'bytes'>
>>> list(b)
[104, 101, 108, 108, 111]
>>> print(b)
b'hello'
>>>
>>> # str
>>> s = 'nai\u0308ve'
>>> type(s)
<class 'str'>
>>> list(s)
['n', 'a', 'i', '̈', 'v', 'e']
>>> print(s)
naïve

Even though bytes and str seem to work the same way, their instances are not compatible with each other, i.e, bytes and str instances can't be used together with operators like > and +. In addition, keep in mind that comparing bytes and str instances for equality, i.e. using ==, will always evaluate to False even when they contain exactly the same characters.

>>> # concatenation
>>> b'hi' + b'bye' # this is possible
b'hibye'
>>> 'hi' + 'bye' # this is also possible
'hibye'
>>> b'hi' + 'bye' # this will fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat str to bytes
>>> 'hi' + b'bye' # this will also fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "bytes") to str
>>>
>>> # comparison
>>> b'red' > b'blue' # this is possible
True
>>> 'red'> 'blue' # this is also possible
True
>>> b'red' > 'blue' # you can't compare bytes with str
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'bytes' and 'str'
>>> 'red' > b'blue' # you can't compare str with bytes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '>' not supported between instances of 'str' and 'bytes'
>>> b'blue' == 'red' # equality between str and bytes always evaluates to False
False
>>> b'blue' == 'blue' # equality between str and bytes always evaluates to False
False

Another issue when dealing with bytes and str is present when working with files that are returned using the open built-in function. On one hand, if you want ot read or write binary data to/from a file, always open the file using a binary mode like 'rb' or 'wb'. On the other hand, if you want to read or write Unicode data to/from a file, be aware of the default encoding of your computer, so if necessary pass the encoding parameter to avoid surprises.

In Python 2

str consists of sequences of 8-bit values, while unicode consists of sequences of Unicode characters. One thing to keep in mind is that str and unicode can be used together with operators if str only consists of 7-bit ASCI characters.

It might be useful to use helper functions to convert between str and unicode in Python 2, and between bytes and str in Python 3.

lmiguelvargasf
  • 63,191
  • 45
  • 217
  • 228
7

Let's have a simple one-character string 'š' and encode it into a sequence of bytes:

>>> 'š'.encode('utf-8')
b'\xc5\xa1'

For the purpose of this example, let's display the sequence of bytes in its binary form:

>>> bin(int(b'\xc5\xa1'.hex(), 16))
'0b1100010110100001'

Now it is generally not possible to decode the information back without knowing how it was encoded. Only if you know that the UTF-8 text encoding was used, you can follow the algorithm for decoding UTF-8 and acquire the original string:

11000101 10100001
   ^^^^^   ^^^^^^
   00101   100001

You can display the binary number 101100001 back as a string:

>>> chr(int('101100001', 2))
'š'
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jeyekomon
  • 2,878
  • 2
  • 27
  • 37
  • Re *"encode it into a sequence of bytes"*: But it must have had some representation before being encoded. What was that representation? [ISO 8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)? – Peter Mortensen Apr 27 '22 at 23:25
6

From What is Unicode?:

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one.

......

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

So when a computer represents a string, it finds characters stored in the computer of the string through their unique Unicode number and these figures are stored in memory. But you can't directly write the string to disk or transmit the string on network through their unique Unicode number because these figures are just simple decimal number. You should encode the string to byte string, such as UTF-8. UTF-8 is a character encoding capable of encoding all possible characters and it stores characters as bytes (it looks like this). So the encoded string can be used everywhere because UTF-8 is nearly supported everywhere. When you open a text file encoded in UTF-8 from other systems, your computer will decode it and display characters in it through their unique Unicode number.

When a browser receive string data encoded UTF-8 from the network, it will decode the data to string (assume the browser in UTF-8 encoding) and display the string.

In Python 3, you can transform string and byte string to each other:

>>> print('中文'.encode('utf-8'))
b'\xe4\xb8\xad\xe6\x96\x87'
>>> print(b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8'))
中文

In a word, string is for displaying to humans to read on a computer and byte string is for storing to disk and data transmission.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sam Yang
  • 559
  • 2
  • 6
  • 10
  • 1
    "*Unicode provides a unique number for every character*": 1/ Unicode (from Unicode Consortium) is not an encoding but a list of glyph names, UTF-8 or UTF-32 (from ISO) are, 'T' in UTF is for 'transformation'. 2/ You likely meant UTF-8, but numbers are not unique. [Wikipedia](https://en.wikipedia.org/wiki/Unicode): "UTF-8, the dominant encoding [...] uses one byte for the first 128 code points, and up to 4 bytes for other characters". To have a unique sequence for all code points, then you need to use UTF-32, which assigns 4 bytes to each code point, but this encoding is not used in practical. – mins Jun 01 '21 at 09:32
3

Unicode is an agreed-upon format for the binary representation of characters and various kinds of formatting (e.g., lower case/upper case, new line, and carriage return), and other "things" (e.g., emojis). A computer is no less capable of storing a Unicode representation (a series of bits), whether in memory or in a file, than it is of storing an ASCII representation (a different series of bits), or any other representation (series of bits).

For communication to take place, the parties to the communication must agree on what representation will be used.

Because Unicode seeks to represent all the possible characters (and other "things") used in inter-human and inter-computer communication, it requires a greater number of bits for the representation of many characters (or things) than other systems of representation that seek to represent a more limited set of characters/things. To "simplify," and perhaps to accommodate historical usage, Unicode representation is almost exclusively converted to some other system of representation (e.g., ASCII) for the purpose of storing characters in files.

It is not the case that Unicode cannot be used for storing characters in files, or transmitting them through any communications channel. It is simply that it is not.

The term "string," is not precisely defined. "String," in its common usage, refers to a set of characters/things. In a computer, those characters may be stored in any one of many different bit-by-bit representations. A "byte string" is a set of characters stored using a representation that uses eight bits (eight bits being referred to as a byte). Since, these days, computers use the Unicode system (characters represented by a variable number of bytes) to store characters in memory, and byte strings (characters represented by single bytes) to store characters to files, a conversion must be used before characters represented in memory will be moved into storage in files.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
1

A string is a bunch of items strung together. A byte string is a sequence of bytes, like b'\xce\xb1\xce\xac' which represents "αά". A character string is a bunch of characters, like "αά". Synonymous to a sequence.

A byte string can be directly stored to the disk directly, while a string (character string) cannot be directly stored on the disk. The mapping between them is an encoding.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ahmed samy
  • 21
  • 2
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 29 '21 at 14:43
0

Putting it simple, think of our natural languages like - English, Bengali, Chinese, etc. While talking, all of these languages make sound. But do we understand all of them even if we hear them? -

The answer is generally no. So, if I say I understand English, it means that I know how those sounds are encoded to some meaningful English words and I just decode these sounds in the same way to understand them. So, the same goes for any other language. If you know it, you have the encoder-decoder pack for that language in your mind, and again if you don't know it, you just don't have this.

The same goes for digital systems. Just like ourselves, as we can only listen sounds with our ears and make sound with mouth, computers can only store bytes and read bytes. So, the certain application knows how to read bytes and interpret them (like how many bytes to consider to understand any information) and also write in the same way such that its fellow applications also understand it. But without the understanding (encoder-decoder) all data written to a disk are just strings of bytes.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
hafiz031
  • 2,236
  • 3
  • 26
  • 48
-1

The Python languages includes str and bytes as standard "built-in types". In other words, they are both classes. I don't think it's worthwhile trying to rationalize why Python has been implemented this way.

Having said that, str and bytes are very similar to one another. Both share most of the same methods. The following methods are unique to the str class:

casefold
encode
format
format_map
isdecimal
isidentifier
isnumeric
isprintable

The following methods are unique to the bytes class:

decode
fromhex
hex
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
fiftytwocards
  • 117
  • 1
  • 5
  • 2
    Yes, but this answer is fairly incomplete. Strings are higher level, human readable construction that uses characters as building blocks and can't be saved directly to the disk. Whereas, bytes are lower level construction that can directly be saved. Strings and bytes are mapped with encoding. If you know the encoding, you can decode a byte-string object and convert it into a string object. – Redowan Delowar Sep 03 '20 at 19:01