I need to embed binary data into XML files, so I've chosen to use base85 encoding for this.
I have a large bytearray that's filled with the output of calls to struct.pack()
via bytearray.extend(struct.pack(varying_data))
. It then gets compressed with zlib
and encoded with base64.b85encode()
.
This worked all the time, but on a single input file, there is the following strange error:
ValueError: base85 overflow in hunk starting at byte 582200`
I then modified base64.py to print out which value the current chunk has and which bytes it consists of. The input chunk is b'||a|3'
and its value is 4.331.076.573, which is bigger than 256^4 = 4.294.967.296 and thus can't be represented by four bytes (that's where the error comes from).
But the thing I don't understand is: how can this happen?
This is the important part of the code:
elif isinstance(self.content, (bytes, bytearray)):
base85 = zlib.compress(self.content, 9)
# pad=False doesn't make a difference here
base85 = base64.b85encode(base85, pad=True).decode()
base85 = escape_xml(base85)
file.write(base85)
def escape_xml(text):
text = text.replace("&", "&")
text = text.replace("<", "<")
text = text.replace(">", ">")
text = text.replace("\"", """)
text = text.replace("'", "'")
return text
And the code for decoding:
def decode_binary_data(data):
data = unescape_xml(data)
# Remove newline for mixed content support (does not apply in this case)
data = data.split("\n", 1)[0]
# Error!
data = base64.b85decode(data)
return zlib.decompress(data)
def unescape_xml(text):
text = text.replace(""", "\"")
text = text.replace("'", "'")
text = text.replace("<", "<")
text = text.replace(">", ">")
text = text.replace("&", "&")
return text
Base85 can theoretically work with 85^5 = 4.437.053.125 possible combinations, but as it gets input from bytes I'm wondering how this is even possible. Does this come from the compression? That shouldn't be the problem as encoding and decoding should be symmetrical. If it is the problem, how to compress the data anyway?
Choosing Ascii85 instead (a84encode()
) works, but I think that this doesn't really solve the problem, maybe it fails in other cases?
Thank you for your help!