340

Following this python example, I encode a string as Base64 with:

>>> import base64
>>> encoded = base64.b64encode(b'data to be encoded')
>>> encoded
b'ZGF0YSB0byBiZSBlbmNvZGVk'

But, if I leave out the leading b:

>>> encoded = base64.b64encode('data to be encoded')

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\lib\base64.py", line 56, in b64encode
   raise TypeError("expected bytes, not %s" % s.__class__.__name__)
   TypeError: expected bytes, not str

Why is this?

martineau
  • 119,623
  • 25
  • 170
  • 301
dublintech
  • 16,815
  • 29
  • 84
  • 115

5 Answers5

345

base64 encoding takes 8-bit binary byte data and encodes it uses only the characters A-Z, a-z, 0-9, +, /* so it can be transmitted over channels that do not preserve all 8-bits of data, such as email.

Hence, it wants a string of 8-bit bytes. You create those in Python 3 with the b'' syntax.

If you remove the b, it becomes a string. A string is a sequence of Unicode characters. base64 has no idea what to do with Unicode data, it's not 8-bit. It's not really any bits, in fact. :-)

In your second example:

>>> encoded = base64.b64encode('data to be encoded')

All the characters fit neatly into the ASCII character set, and base64 encoding is therefore actually a bit pointless. You can convert it to ascii instead, with

>>> encoded = 'data to be encoded'.encode('ascii')

Or simpler:

>>> encoded = b'data to be encoded'

Which would be the same thing in this case.


* Most base64 flavours may also include a = at the end as padding. In addition, some base64 variants may use characters other than + and /. See the Variants summary table at Wikipedia for an overview.

Ry-
  • 218,210
  • 55
  • 464
  • 476
Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • "it wants a string of 8-bit bytes". A byte in a computer is made of 8 bits and most data types in all programming languages (including a Python str) are made of bytes, so I don't understand what you mean with that. Maybe "it wants a string of 8-bit characters", as an ASCII string? – Alan Evangelista Jul 29 '21 at 11:49
  • 1
    @AlanEvangelista Conceptually, a Python string is a sequence of Unicode characters. It needn't have any particular underlying binary representation. On the other hand, a `bytes` or `bytearray` object actually does represent a sequence of bytes/octets. (Although it needn't have any particular underlying binary representation either.) – user2846495 Aug 23 '21 at 12:57
  • @AlanEvangelista Not every computer has an 8-bit byte. These days it is very unlikely to find a device with a byte other than 8 bits, except maybe for DSPs, but in the old days architectures using 6 or 7 bits per byte were not uncommon at all. The cursed grimoires of the ancients keep forbidden knowledge of 32 and even blasphemous 48-bit bytes. – jetpack_guy Dec 12 '22 at 20:59
229

Short Answer

You need to push a bytes-like object (bytes, bytearray, etc) to the base64.b64encode() method. Here are two ways:

>>> import base64
>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'

Or with a variable:

>>> import base64
>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'

Why?

In Python 3, str objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take a string and call the .encode() method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.

Base-64 Encoding in Python 3

Originally the question title asked about Base-64 encoding. Read on for Base-64 stuff.

base64 encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.

In base64 encoding, the translation is done from left to right; those first 64 characters are why it is called base64 encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.

Example:

>>> data = b'test'
>>> for byte in data:
...     print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>

If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):

base-2:  01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10:                            1952805748
base-64:  B      0      Z      X      N      0

base64 encoding, however, will re-group this data thusly:

base-2:  011101  000110  010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10:     29       6      21     51     29      0
base-64:      d       G       V      z      d      A

So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64 encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64 encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.

Let's test this to see if I am being dishonest:

>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='

Why use base64 encoding?

Let's say I have to send some data to someone via email, like this data:

>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())
   
>>> print(data)
b'\x04msg\x08\x08\x08   '
>>>

There are two problems I planted:

  1. If I tried to send that email in Unix, the email would send as soon as the \x04 character was read, because that is ASCII for END-OF-TRANSMISSION (Ctrl-D), so the remaining data would be left out of the transmission.
  2. Also, while Python is smart enough to escape all of my evil control characters when I print the data directly, when that string is decoded as ASCII, you can see that the 'msg' is not there. That is because I used three BACKSPACE characters and three SPACE characters to erase the 'msg'. Thus, even if I didn't have the EOF character there the end user wouldn't be able to translate from the text on screen to the real, raw data.

This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.

Alan W. Smith
  • 24,647
  • 4
  • 70
  • 96
Greg Schmit
  • 4,275
  • 2
  • 21
  • 36
  • 16
    `base64.b64encode(s.encode()).decode()` is not very pythonic when all you want is a string to string conversion. `base64.encode(s)` should be enough at least in python3. Thanks for a very good explanation about strings and bytes in python – MortenB Feb 22 '18 at 09:53
  • 3
    @MortenB Yeah, it's weird, but on the upside is very clear what is happening as long as the engineer is aware of the difference between arrays of bytes and strings, since there is not a single mapping (encoding) between them, as other languages assume. – Greg Schmit Feb 22 '18 at 17:44
  • 5
    @MortenB By the way, `base64.encode(s)` wouldn't work in Python3; are you saying that something like that should be available? I think the reason it might be confusing is that, depending on the encoding and the content of the string, `s` might not have 1 unique representation as an array of bytes. – Greg Schmit Feb 22 '18 at 17:47
  • Schmitt: it was just an example of how simple it should be. the most common usecases should be like that. – MortenB Feb 23 '18 at 18:41
  • 2
    @MortenB but b64 is not just meant for text, any binary content can be b64 encoded (audio, images, etc). Making it work as you propose in my opinion hides the difference between text and byte array even more, making debugging harder. It simply moves the difficulty somewhere else. – Michael Ekoka Jun 02 '20 at 06:06
  • it uses `/` char so its a no go on linux :( – CpILL Jun 15 '20 at 09:35
  • @CplLL I did all of this on a Linux platform, so it should work fine; did you have a specific question or can you elaborate on your problem? – Greg Schmit Jun 15 '20 at 13:46
  • 1
    @MortenB `base64.encode` takes two file-like objects. "base64.encode(s) should be enough at least in python3" is incorrect – Mattwmaster58 Jul 23 '20 at 18:19
39

If the data to be encoded contains "exotic" characters, I think you have to encode in "UTF-8"

encoded = base64.b64encode (bytes('data to be encoded', "utf-8"))
Alecz
  • 1,951
  • 1
  • 19
  • 18
30

If the string is Unicode the easiest way is:

import base64                                                        

a = base64.b64encode(bytes(u'complex string: ñáéíóúÑ', "utf-8"))

# a: b'Y29tcGxleCBzdHJpbmc6IMOxw6HDqcOtw7PDusOR'

b = base64.b64decode(a).decode("utf-8", "ignore")                    

print(b)
# b :complex string: ñáéíóúÑ
alfredocambera
  • 3,155
  • 34
  • 29
  • Really not the easiest way, but one of the most clear ways, when it is important which encoding is used for transmitting the string, which is part of the "protocol" of the data transmission through base64. – xuiqzy Apr 10 '20 at 21:31
13

There is all you need:

expected bytes, not str

The leading b makes your string binary.

What version of Python do you use? 2.x or 3.x?

Edit: See http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit for the gory details of strings in Python 3.x

  • Thanks I am using, 3.x. Why does Python wants to convert it explictly to binary. The same in Ruby would be... requires > "base64" and then > Base64.encode64('data to be encoded') – dublintech Jan 18 '12 at 10:19
  • 2
    @dublintech Because (unicode) text is different from raw data. If you wanted to encode a text string in Base64, first you need to determine the character encoding (like UTF-8) and then you have bytes rather than characters, that you can encode in a text ascii-safe form. – fortran Jan 18 '12 at 10:44
  • 2
    This doesn't answer the question. He knows it works with a bytes object, but not a string object. The question is *why*. – Lennart Regebro Jan 18 '12 at 13:32
  • @fortran Default Python3 string encoding is UTF, do not know, why it has to be explicitly set. – xmedeko Jul 28 '16 at 12:03