10

I want to encode and decode binary data within an XML file (with Python, but whatever). I have to face the fact that an XML tag content has illegal characters. The only allowed ones are described in XML specs:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Which means that the unallowed are:

  • 29 Unicode control characters are illegal (0x00 - 0x20) ie (000xxxxx) except 0x09, 0x0A, 0x0D
  • Any Unicode character representation above 2 bytes (UTF-16+) is illegal (U+D800 - U+DFFF) ie (11011xxx)
  • The special Unicode noncharacters are illegal (0xFFFE - 0xFFFF) ie (11111111 1111111x)
  • <, >, & according to this post for entities content

1 byte can encode 256 possibles. With these restrictions the first byte is limited to 256-29-8-1-3 = 215 possiblities.

Of that first bytes's 215 possibilites, base64 only uses 64 possibilites. Base64 generates 33% overhead (6 bits becomes 1 byte once encoded with base64).

So my question is simple: Is there an algorithm more efficient than base64 to encode binary data within XML? If not, where should we start to create it? (libraries, etc.)

NB: You wouldn't answer this post by "You shouldn't use XML to encode binary data because...". Just don't. You could at best argue why not to use the 215 possibilities for bad XML parser's support.

NB2: I'm not speaking about the second byte but there are certainly some considerations that wa can develop regarding the number of posibilities and the fact it should start by 10xxxxxx to respect UTF8 standard when we use the supplementary Unicode planes (what if not?).

Community
  • 1
  • 1
KrisWebDev
  • 9,342
  • 4
  • 39
  • 59
  • 1
    Well, you could use an approach similar to [yEnc](http://en.wikipedia.org/wiki/YEnc) whereby you only escape characters which have another meaning within the medium in which they're used, but I don't know of any existing implementations of such an approach for your particular case. See also [this page](http://en.wikipedia.org/wiki/Binary-to-text_encoding#Encoding_standards). – Aya Jun 25 '13 at 15:58
  • The way you frame the question is a little less clear because you confuse the concept of Unicode Scalar Value (an integer) and the Code Units of a Unicode Transformation Format (especially UTF-8). When you say "the unallowed are", I think you mean "the unallowed first code units in UTF-8 are", but you also refer to Characters, which should be describe as scalar values. – Jim DeLaHunt Mar 11 '23 at 23:05

5 Answers5

8

Thank you Aya for the Asci85 link, there are very good ideas.

I developed them below for our case.


UTF-8 characters possibilites:


For 1 byte characters (0xxxxxxx): 96 possibilites per byte

  • + UTF-8 ASCII chars 0xxxxxxx = +2^7
  • - UTF-8 Control chars 000xxxxx = -2^5
  • + XML allowed UTF-8 control chars (00000009, 0000000A, 0000000D) = +3
  • - XML entity unallowed chars (<, >, &) = -3

EDIT: This is for XML1.0 specs. XML 1.1 specs allows the use of control chars except 0x00...

For 2-bytes characters (110xxxxx 10xxxxxx): 1920 possibilites per 2 bytes

  • + UTF-8 2-byte chars 110xxxxx 10xxxxxx = +2^11
  • - UTF-8 illegal non-canonical chars (1100000x 10xxxxxx) = -2^7

For 3-bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 61440 possibilites per 3 bytes

  • + UTF-8 3-byte chars 1110xxxx 10xxxxxx 10xxxxxx = +2^16
  • - UTF-8 illegal non-canonical chars (11100000 100xxxxx 10xxxxxx) = -2^11
  • - Unicode reserved UTF-16 codepoints (11101101 101xxxxx 10xxxxxx) = -2^11

And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.


The coding possibilites in let's say a 3-byte space


So let's see what combinations we can do on a 3-byte (24bit) space:

  • 0xxxxxxx 0xxxxxxx 0xxxxxxx : That's 96*96*96 = 884736 possibilities
  • 0xxxxxxx 110xxxxx 10xxxxxx : That's 96*1920 = 184320 possibilities
  • 110xxxxx 10xxxxxx 0xxxxxxx : That's 1920*96 = 184320 possibilities
  • 1110xxxx 10xxxxxx 10xxxxxx : That's 61440 = 61440 possibilities

There would be other possibilites (like a 3-bytes char ending or starting in the space but like 4-bytes chars, that would be difficult to evaluate (for me) and probably negligible).

Total number of possibilities:

  • A 3-byte space has 2^24 = 16777216 possibilities.
  • UTF-8 COMPATIBLE possibilites in that space is 884736+2*184320+61440 = 1314816 possibilities.

How much overhead does that mean?

  • 24-bit space usable bits : Log2(16777216)=24 (of course! that's for the math understanding)
  • This space's UTF-8 useful bits: Log2(1314816)=20,32 useful bits.
  • That means we need 24 bits of space to encode 20,32 bits of useful information, ie. the minimum theorical overhead is 18% overhead. Way better than Base64's 33% overhead and Ascii85's 25% overhead!

EDIT: This is for XML1.0 specs. With XML1.1 (not widely supported...), the theorical overhead is 12.55%. I managed to make a binary safe algorithm with 14.7% overhead for XML1.1.


How to get near this 18% overhead?


The bad news is that we can't get easily that 18% ovearhead without using a big "dictionnary" (ie long enconding sets). But it's easy to get 20%, and quite easy but less practical to get 19%.

Good coding lengths candidates:

  • 6 bits can encode 5 bits with 20% overhead (2^(6*0,84) > 2^5)
  • 12 bits can encode 10 bits with 20% overhead (2^(12*0,84) > 2^10)
  • 24 bits can encode 20 bits with 20% overhead (2^(24*0,84) > 2^20)
  • 25 bits can encode 21 bits with 19% overhead (2^(25*0,84) > 2^21)

NB: 0,84 is the average "usefulness" of a space bit (20,32/24).


How to build our encoding algorithm?


We need to build a "dictionnary" that will map the "space possibles" (randoms sequence of bits whose length is 5, 10, 20 or 21 bits depending on the chosen coding length for the algorithm - just choose one) into the utf8-compatible sequences (utf8 bit sequence whose length is 6, 12, 24 or 25 bits accordingly).

The easiest starting point would be the encoding of 20bits sequence into 24bits compatible UTF-8 sequences: that's exactly the example that was taken above to calculate the possibilites and that's 3 UTF-8 bytes long (so we won't have to worry about unterminated UTF8 chars).

Note that we MUST USE the 2-byte (or above) UTF-8 characters encoding space to reach the 20% overhead. With only 1-byte long UTF8 characters we can only reach 25% overhead with RADIX-24. However, the 3-byte long UTF-8 characters are needless to reach the 20% overhead.

That's the next challenge for this question. Who wants to to play? :)


A proposal of algorithm, I'll name BaseUTF-8 for XML


20 binary bits to encode: ABCDEFGHIJKLMNOPQRST

Resulting UTF-8 string named "encoded": 24 bits long

Mathematical encoding algorithm (not based on any known programming language):

If GH != 00 && NO != 00:
    encoded = 01ABCDEF 0GHIJKLM 0NOPQRST # 20 bits to encode, 21 space bits with restrictions (1-byte UTF-8 char not starting by 000xxxxx ie ASCII control chars)

If ABCD != 0000:
    If GH == 00 && NO == 00: # 16 bits to encode
        encoded = 0010ABCD 01EFIJKL 01MPQRST    
    Else If GH == 00:  # 18 bits to encode, 18 space bits with restrictions (1-byte  UTF-8 ASCII control char, 2-bytes UTF-8 char noncanonical)
        encoded = 0NOFIJKL 110ABCDE 10MPQRST
    Else If NO == 00:  # 18 bits to encode
        encoded = 110ABCDE 10MPQRST 0GHFIJKL

If ABCD == 0000: # 16 bits to encode
    encoded = 0011EFGH 01IJKLMN 01OPQRST

On "encoded" variable apply:
    convert < (0x3C) to Line Feed (0x0A)
    convert > (0x3E) to Cariage Return (0x0D)
    convert & (0x26) to TAB (0x09)

And that's how you get a 20% overhead only.

This algorithm doesn't provide yet a way to manage string termination, when the string to encode is not a multiple of 20. The decoding algorithm has also to be provided, but that's quite easy (just don't forget to throw exceptions to force the unicity of decoding).

KrisWebDev
  • 9,342
  • 4
  • 39
  • 59
  • 1
    That's very clever. Two comments: (1) Your use of line feed, carriage return, and tab is fragile. These characters are all too likely to be translated/munged in XML documents in transit: carriage returns removed when converting line endings, tabs converted to spaces, reformatting for XML pretty-printing, etc... I think you really have to avoid them. – Celada Jun 28 '13 at 13:53
  • (2) I don't understand the comment "And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.": outside the BMP, there are *NO* forbidden Unicode characters, no? Anyway, doesn't matter: your final proposed encoding scheme doesn't even use Unicode characters beyond U+7FF, let alone characters outside the BMP. – Celada Jun 28 '13 at 13:55
  • @Celada: (2) 4-bytes long UTF-8 characters (ie starting with 11110) have at least the following restrictions: non-canonical sequences (ie F0 must be followed by 90->BF, F4 must be followed by 80->8F), illegal unicode codepoints above U+10FFFF (ie illegal F4 90->FF FF). – KrisWebDev Jun 28 '13 at 14:35
  • That's not too onerous: it's roughly the same number of restrictions as you have with 3-byte UTF-8, where 0xe0 must be followed by 0xa0~0xbf and 0xed cannot be followed by either 0xa0 or 0xa1. Actually I noticed just now that U+xFFFE and U+xFFFF are guaranteed non-characters, where 0 <= x <= 0x10 (i.e. the last two code points in every plane including the BMP). I guess that means they would have to be avoided. I knew U+FFFE and U+FFFF (in the BMP) were invalid, but I didn't know about the other planes. Anyway, once again, moot because your scheme uses no Unicode character greater than U+7FF. – Celada Jun 28 '13 at 15:14
  • You could do this more simply. Define a range R1 of 90 characters from U+0021 to U+007D, excluding `<`, `;`, and `>`, as integers to 0..89. Define a range R2 of 1920 characters from U+0080 to U+07FF, as integers 0..1919. Convert 20 binary bits to a 20-bit integer, INT. If 0<=INT< 90^3, emit 3 R1 chars where C1a*C1b*C1c = INT. If 90^3 <= INT < 90^3+1920, emit 1 R1 char and 1 R2 char where C1a*C2a = (INT-90^3). Else emit 1 R2 char and 1 R1 char where C2a*C1 = (INT-(90^3*1920)). And you have U+007E, U+007F for special use, and no XML syntax conflicts. Proof: 90^3 + 90*1920 + 1920*90 > 2^20. – Jim DeLaHunt Mar 12 '23 at 01:56
  • A strength of this scheme is that it does achieve a low 1.2 encoded bits per data bit — in UTF-8. A weakness is that it is very closely tied to the details of UTF-8 code unit structure. If this encoded data were to be converted to UTF-16, it would take 16-48 encoded bits to represent 20 data bits, or 0.8-2.4 encoded bits per data bit. If the encoded data were to be converted to UTF-32, it would take 32-96 encoded bits per 20 data bits, or 1.6-4.8 encoded bits per data bit. Base64 is a predictable 1.33, 2.67, or 5.33 encoded bits per data bit in UTF-8, UTF-16, and UTF-32 respectively. – Jim DeLaHunt Mar 12 '23 at 02:06
2

I have developed the concept in a C code.

The project is on GitHub and is finally called BaseXML: https://github.com/kriswebdev/BaseXML

It has a 20% overhead, which is good for a binary safe version.

I had a hard time making it work with Expat, which is the behind the scene XML parser of Python (THAT DOESN'T SUPPORT XML1.1!). So you'll find the BaseXML1.0 Binary safe version for XML1.0.

I will maybe release the "for XML1.1" version later if requested (it is also binary safe and have a 14.7% overhead), it's ready and working indeed but useless with Python built-in XML parsers so I don't want to confuse people with too many versions (yet).

KrisWebDev
  • 9,342
  • 4
  • 39
  • 59
1

It's worse than that: you don't actually have 215 different byte values you can use. The resulting binary data have to be valid in whatever encoding the XML is represented in (which is almost certainly UTF-8), which means that many, many byte sequences are forbidden. 0xc2 followed by 0x41 would be just one random example. XML is text (a sequence of Unicode characters), not binary data. When transmitted, it is encoded using some encoding (which is almost alwats UTF-8). If you try to treat it as binary data, then you are, in my opinion, asking for way more trouble than it's worth dealing with.

If you still want to do this...

XML is text. So let's not try to encode your binary data as binary data. That will not lead to an easy or obvious way to showhorn it into an XML document. Let's try instead encoding your binary data as text!

Let's try one very simple encoding:

  • Group your binary data into blocks of 20 bits
  • Encode each group of 20 bits as the Unicode character U+10000 plus the numeric value of the 20 bits.

This will mean you exclusively use characters from planes 1 through 16. All of the restricted characters are in plane 0 (the BMP), so you are safe here.

When you then encode this XML document as UTF-8 for transmission, each of these characters will require 4 bytes to encode. So you consume 32 bits for every 20 bits of original data, which is 60% overhead compared to pure binary encoding of the original data. This is worse than base64's 33%, which makes it a terrible idea.

This encoding scheme is slightly wasteful because it makes no use of BMP characters. Can we use BMP characters to make it better? Not trivially. 20 is the largest size we can use for the groups (log(0x10FFFF) ~ 20.09). We could remap out scheme to use some as manu BMP characters as possible because these take less space to encode with UTF-8, but not only would this complicate the encoding a lot (the forbidden characters are scattered, so we have several cases to handle) but it can only lead to improvement for about 6.25% of bit patterns (fraction of Unicode characters that are in the BMP), and for the majority of that 6.25%, we'd save only one byte. For random data, the overhead decreases from 60% to around 55%. The result would still be much worse than base64 except for some very contrived data. Note that the overhead is data-dependant though. For 0.2% of bit patterns, you will actually get compression instead of overhead (60% compression for 0.012% of patterns and 20% compression for 0.18% of patterns). But these fractions are really low. It's just not worth it.

To put this another way: if you want to encode anything using 4-byte UTF-8 sequences, you need to use 32 bits per sequence (of course) but 11 of those bits are fixed and unchangeable: the bits must fit the pattern 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx and there are only 21 xs in there). That overhead of 60% is built in to UTF-8 so if you want to use this as the basis of any encoding that improves upon the overhead of base64, you are starting from behind!

I hope this convinces you that you can't improve on the density of base64 using any scheme of this type.

Celada
  • 21,627
  • 4
  • 64
  • 78
  • 2
    Surely the existence of [Ascii85](http://en.wikipedia.org/wiki/Ascii85) proves this wrong? – Aya Jun 25 '13 at 19:25
  • @Aya: let me clarify. Sure, base85 is more compact than base64. My point is that if you try to use beyond the first 128 Unicode characters, you lose because their UTF-8 representations are not compact: 2-byte UTF-8 has a built-in ~45% overhead, 3-byte UTF-8 has a built-in ~50% overhead, and 4-byte UTF-8 has a built-in ~60% overhead. If you stick to 1-byte UTF-8 then you're in business. – Celada Jun 25 '13 at 19:39
  • 1
    I realize that. It just seems odd to propose a solution which makes no use of the most compact character representations in UTF-8, i.e. those which can be expressed in a single byte. – Aya Jun 25 '13 at 19:50
  • I don't propose a solution. Rather my point is to dissuade the OP from the line of thought that it would be advantageous to make use of characters than take >1 byte in UTF-8. I feel I've made this clear in my concluding sentence. Occasionally, IMHO, the best answer to a SO question is a "you shouldn't do that" answer. – Celada Jun 25 '13 at 19:58
  • 1
    In retrospect, I think it might be possible to make use of characters beyond those which can be represented in a single byte. Just because a two-byte sequence has a 45% overhead, doesn't mean you can't differentiate between two-byte and one-byte sequences, so it ought to be possible to construct a lookup table which makes effective use of two-byte sequences. – Aya Jun 25 '13 at 20:06
  • Not sure about that... There are 1920 characters which encode to 2 bytes. No matter how you slice it, encoding `log₂(1920)=10.9` bits of information using 16 bits means about 47% overhead. You can add the 128 characters that encode to 1 byte into the mix to improve the average compactness, but since there are only 128 of those characters it doesn't make a huge difference. – Celada Jun 25 '13 at 20:32
  • I'm not 100% on that point. I was thinking of writing a quick prototype to see if it was possible, but I don't have time right now. – Aya Jun 25 '13 at 20:36
  • Thanks Aya and Celeda. Just beware that using e.g. the 4th-byte long character enconding doesn't mean there is more overhead (47%). The presence of the UTF8 4-byte "header" is in itself a valuable information. E.g. we could decode **SOME** 16-bit string to a longer 180567bit-long string if we would: the overhead would be negative. It's just we can't encode every 180567bit sequence into a 16-bit one, but we can encode some. And that's the point: the more you can map SOME strings sequences in the different encoding ways (1,2,3,4-byte long), the more efficient the algorithm will be. – KrisWebDev Jun 28 '13 at 08:31
0
  1. Base122 can give you 14% overhead (Write-up: http://blog.kevinalbs.com/base122 JS: https://github.com/kevinAlbs/Base122 Python: https://github.com/Theelx/pybase122)

  2. yenc + escaped js literal string will give you 1-2% overhead - https://github.com/eshaz/simple-yenc (see instructions regarding HTML here: https://github.com/eshaz/simple-yenc/issues/1#issuecomment-1066094764)

o17t H1H' S'k
  • 2,541
  • 5
  • 31
  • 52
0

Your direct question: Is there an algorithm more efficient than base64 to encode binary data within XML?

You don't define "efficient", but I think you mean: requires fewest bits of encoded representation per bit of binary data. You could also define "efficient" to include: shortest time to implement, shortest time to debug, lowest risk of bugs appearing in the code when in production, but apparently you do not include them.

I suggest you consider encoding the binary data with some form of Ascii85, then escaping any characters in the Ascii85 encoded form which special XML meaning. This will typically take a little more than 1.25 bits of encoded representation per bit of binary data. It is somewhat better than base64, which requires a little more than 1.33 bits.

You have a choice of Ascii85 variants to use:

  • as defined in section ASCII85EncodeFilter of the Postscript Language Reference, 3rd edition, p. 131-132. This has the advantage of being well-defined and proven. But it uses characters <, &, and > in its encoded representation, which conflict with XML syntax.
  • as defined by the btoa and atob utilities, e.g. as posted by Joe Orost to comp.compression in 1991. This similar advantages and disadvantages to the PostScript language ASCII85.
  • Any other form mentioned in the Wikipedia Ascii85 article.
  • You could define an encoding which has the advantage that avoids the characters <, &, and >, which conflict with XML syntax, in its encoded representation. This will have the advantage of not having any special characters to escape for XML syntax, and the disadvantages of requiring more development, and perhaps being more prone to bugs in the field.

Whichever Ascii85 form you use, modify the encoded text to avoid conflicts with XML syntax. There are two main options:

  1. run the encoded text through an XML syntax escaping function. This will replace characters <, &, and > in the encoded representation with &lt;, &amp;, and &gt; respectively. This gives you 3-4 bytes of overhead per character replaced.

  2. Wrap the encoded text in <![CDATA[...]]> syntax, and escape any occurrence of ]]> in the encoded text as ]]]]><![CDATA[>. This gives you 12 bytes of constant overhead, and 12 bytes of overhead for each occurrence of ]]> in the encoded representation.

In either case, the number of replacements depends on the specifics of your binary data. You could even choose which kind of escaping to use based on which has the shorter XML-compatible encoded representation. Both forms have identical XML semantics.

The result is encoded text which can be used as the text content of an XML element. It is compatible with XML syntax, and it is transparent to whitespace addition and removal.

Jim DeLaHunt
  • 10,960
  • 3
  • 45
  • 74