2

I need to embed binary data into XML files, so I've chosen to use base85 encoding for this.

I have a large bytearray that's filled with the output of calls to struct.pack() via bytearray.extend(struct.pack(varying_data)). It then gets compressed with zlib and encoded with base64.b85encode().

This worked all the time, but on a single input file, there is the following strange error:

ValueError: base85 overflow in hunk starting at byte 582200`

I then modified base64.py to print out which value the current chunk has and which bytes it consists of. The input chunk is b'||a|3' and its value is 4.331.076.573, which is bigger than 256^4 = 4.294.967.296 and thus can't be represented by four bytes (that's where the error comes from).

But the thing I don't understand is: how can this happen?

This is the important part of the code:

elif isinstance(self.content, (bytes, bytearray)):
    base85 = zlib.compress(self.content, 9)

    # pad=False doesn't make a difference here
    base85 = base64.b85encode(base85, pad=True).decode()

    base85 = escape_xml(base85)

    file.write(base85)
def escape_xml(text):

    text = text.replace("&", "&")
    text = text.replace("<", "&lt;")
    text = text.replace(">", "&gt;")
    text = text.replace("\"", "&quot;")
    text = text.replace("'", "&apos;")

    return text

And the code for decoding:

def decode_binary_data(data):
    data = unescape_xml(data)

    # Remove newline for mixed content support (does not apply in this case)
    data = data.split("\n", 1)[0]

    # Error!
    data = base64.b85decode(data)

    return zlib.decompress(data)

def unescape_xml(text):
    text = text.replace("&quot;", "\"")
    text = text.replace("&apos;", "'")
    text = text.replace("&lt;", "<")
    text = text.replace("&gt;", ">")
    text = text.replace("&amp;", "&")

    return text

Base85 can theoretically work with 85^5 = 4.437.053.125 possible combinations, but as it gets input from bytes I'm wondering how this is even possible. Does this come from the compression? That shouldn't be the problem as encoding and decoding should be symmetrical. If it is the problem, how to compress the data anyway?

Choosing Ascii85 instead (a84encode()) works, but I think that this doesn't really solve the problem, maybe it fails in other cases?

Thank you for your help!

timodriaan
  • 55
  • 3
  • 8

2 Answers2

1

I found the problem! Neither the base85 algorithm nor the compression is the issue here. It is the XML.

For exporting/writing the XML with the included base85 string, I wrote my own class and functions to export XML so that it looks pretty (xml.etree.ElementTree writes everything into one line and for this project I can't use external packages from pip). This is why the base85 string has to get escaped manually.

But for reading the XML files, I use xml.etree.ElementTree. I didn't know that the most XML libraries (un)escape strings automatically (which makes sense).

So, the problem was the manual unescaping, which ElementTree does automatically. As a result, the base85 string got unescaped twice. And as the base85 alphabet contains every letter that's included in the XML escape strings ($amp;, $lt; etc.), and with over 500.000 characters in that base85 string, it is likely that there is a combination of characters in the output string that forms a valid XML escape string.

And this was the issue. &lt; was contained in the unescaped base85 string and got unescaped again, resulting in an offset of all following bytes that led to this error.

timodriaan
  • 55
  • 3
  • 8
1

I work with LabView, Python and javascript a lot, and had to create my own Base85 encode and decode routines for LabView, which only has MD5 checksum. For encryption or good obfuscation you have to roll your own. Maybe future versions of LabView will have Base85 in a library.

The point I am making is that I now have all 3 flavors of base85. Ascii85, Base85 and Z85. Each one has a unique char set it uses when converting from base10 to base85. Each version can be tripped up (corrupted output) by such things as control characters, too many space characters in a row, symbol-heavy stuff like HTML and XML, characters over 126 (Tilde).

To safely encode large text files, especially multi-line and symbol-heavy stuff I just have the code sense all of these potential problems and convert to hexadecimal first. Yes it doubles the character count, but the base10 to base85 engine will not crash. Even for large plain-text files Z85 would crash after a 1000 characters or so, and the problem was the Z85 char map, which has the symbols out of decimal order, so on long strings an overflow would occur. For my own purposes I changed the Z85 char map so the symbols are in decimal order, and now Z85 no longer crashes on large files.

Ascii85, Base85 and Z85 are subject to crashing due to the same issues mentioned above, whether written in python, javascript or LabView. Often it is multiple consecutive symbols/spaces that cause a math overflow, so the output is corrupted and cannot be decoded.

NOTE: It is very important to pad your strings to be divisible by 4, and when decoding pad your hash string with 'u' or a tilde sign so the hash is divisible by 5.