30

I have a string that I'm encoding into base64 to conserve space. Is it a big deal if I remove the equal sign at the end? Would this significantly decrease entropy? What can I do to ensure the length of the resulting string is fixed?

>>> base64.b64encode(combined.digest(), altchars="AB")
'PeFC3irNFx8fuzwjAzAfEAup9cz6xujsf2gAIH2GdUM='

Thanks.

ensnare
  • 40,069
  • 64
  • 158
  • 224
  • 15
    *I have a string that I'm encoding into base64 to conserve space* - Base64 doesn't conserve space, it does the opposite. It is typically used to express arbitrary byte sequences in (usually ascii-based) line protocols. – MattH Jan 26 '12 at 15:45
  • 1
    Is it just me that's surprised to read 'string that I'm encoding into base64 to conserve space'? Base64 is more verbose than your average string and its more common use is to transfer BINARY data as a string. – jv42 Jan 26 '12 at 15:47
  • 2
    And also, please don't think that Base64 is encryption, as many people seem to do. – jv42 Jan 26 '12 at 15:48
  • @MattH it conserves space when compared to, say, hex. – jterrace Jan 26 '12 at 16:51
  • 3
    You shouldn't use `AB` for the altchars... base64 uses `A-Za-z0-9` to represent the 6-bit values 0-61, altchars selects what's used for the values of 62 and 63. Using something that's already assigned to a value will cause decoding errors... e.g. ``b64decode(b64encode('\x00','AB'),'AB')`` will return `'\xfb'` instead of `'\x00'`. Even if you're just hashing, that *is* discarding entropy, though removing padding isn't. – Eli Collins Jan 27 '12 at 00:11
  • Using Base64 to conserve space is usually implicitly compared to hex encoding, so I get where he is coming from – idanzalz May 13 '13 at 12:40

7 Answers7

20

Every 3 bytes you need to encode as Base64 are converted to 4 ASCII characters and the '=' character is used to pad the result so that there are always a multiple of 4 encoded characters. If you have an exact multiple of 3 bytes then you will get no equal sign. One spare byte means you get two '=' characters at the end. Two spare bytes means you get one '=' character at the end. depending on how you decode the string it may or may not see this as a valid string. With the example string you have, it doesn't decode, but some simple strings I've tried do decode.

You can read this page for a better understanding of base64 strings and encoding/decoding.

http://www.nczonline.net/blog/2009/12/08/computer-science-in-javascript-base64-encoding/

There are free online encoder/decoders that you can use to check your output string

Brian
  • 2,229
  • 17
  • 24
18

Looking at your code:

>>> base64.b64encode(combined.digest(), altchars="AB")
'PeFC3irNFx8fuzwjAzAfEAup9cz6xujsf2gAIH2GdUM='

The string that's being encoded in base64 is the result of a function called digest(). If your digest function is producing fixed length values (e.g. if it's calculating MD5 or SHA1 digests), then the parameter to b64encode will always be the same length.

If the above is true, then you can strip of the trailing equals signs, because there will always be the same number of them. If you do that, simply append the same number of equals signs to the string before you decode.

If the digest is not a fixed length, then it's not safe to trim the equals signs.

Edit: Looks like you might be using a SHA-256 digest? The SHA-256 digest is 256 bits (or 32 bytes). 32 bytes is 10 groups of 3, plus two left over. As you'll see from the Wikipedia section on padding; that'd mean you always have one trailing equals. If it is SHA-256, then it'd be OK to strip it, so long as you remember to add it again before decoding.

tew
  • 2,723
  • 5
  • 23
  • 35
Martin Ellis
  • 9,603
  • 42
  • 53
  • You're right. The encoded digest in the original question has 44 bytes. So we have 256 bits, which is (10 groups of 3 bytes) + (2 bytes leftover), being encoded into 11 groups of 4 bytes. I've updated my comment. Thanks. – Martin Ellis Jan 26 '12 at 17:17
  • It seems that PHP and JavaScript's in-built decode functions (`base64_decode` and `atob`) don't care about the padding. Somebody [here](https://stackoverflow.com/a/56240229) says "The only reason to have it on in that case might be to add tolerance to decoders that don't work without the padding. If you control both ends, that's a non-concern." Thoughts? – joe Feb 01 '20 at 12:15
16

It's fine to remove the equals signs, as long as you know what they do.

Base64 outputs 4 characters for every 3 bytes it encodes (in other words, each character encodes 6 bits). The padding characters are added so that any base64 string is always a multiple of 4 in length, the padding chars don't actually encode any data. (I can't say for sure why this was done - as a way of error checking if a string was truncated, to ease decoding, or something else?).

In any case, that means if you have x base64 characters (sans padding), there will be 4-(x%4) padding characters. (Though x%4=1 will never happen due the factorization of 6 and 8). Since these contain no actual data, and can be recovered, I frequently strip these off when I want to save space, e.g. the following::

from base64 import b64encode, b64decode

# encode data
raw = b'\x00\x01'
enc = b64encode(raw).rstrip("=")

# func to restore padding
def repad(data):
     return data + "=" * (-len(data)%4)
raw = b64decode(repad(enc))
Eli Collins
  • 8,375
  • 2
  • 34
  • 38
  • Somebody with more knowledge, please correct my C# version if it is incorrect: var pad = (text.Length % 4); if (pad == 3) pad = 1; for (int i = 0; i < pad; i++) text += "="; – nikib3ro Jun 15 '12 at 06:04
  • According to what @Eli Collins says here, the C# equivalent for what you (@nikib3ro) wrote is `var pad = text.Length % 4`. You won't need the `if` block because there will be no case where the result of mod becomes `1`. – Reza Dec 01 '21 at 11:42
  • I found that you **do** need the `if (pad == 3) pad = 1;` statement. – AGB May 27 '22 at 13:06
1

Other than in the case @Martin Ellis points out, messing with the padding characters can lead to getting a

TypeError: Incorrect padding

and And producing some garbage while you're at it.

As stated by @MattH, base64 will do the opposite of conserving space.

Instead to conserve space, you should apply compression algorithms such as zlib.

For example, zlib

import zlib

s = '''large string....'''
compressed = zlib.compress(s)

compression_ratio = len(s)*1.0/len(compressed)    

# And later...
out = zlib.decompress(compressed) 

# The above function is also good for relieving stress.
HeyWatchThis
  • 21,241
  • 6
  • 33
  • 41
1

those are padding and you don't save much by removing them as there are at most two of them, so if you want to save space look else where. and by the reference to entropy are you compressing these base64 strings? if so even if you do remove them, they will not have much of an effect on the compressed size.

Dan D.
  • 73,243
  • 15
  • 104
  • 123
0

Unless you are concatenating multiple Base64 encoded files or strings then it's safe to remove the equals signs as they're only used for padding purposes.

Kevin Guto
  • 1,023
  • 6
  • 6
0

I don't think so.
https://en.wikipedia.org/wiki/Base64#Output_padding

These equals are "useful".

Pang
  • 9,564
  • 146
  • 81
  • 122