I want to convert a buffer of binary data in bytes into a buffer of sextets, where a sextet is a byte with the two most significant bits set to zero. I also want to do the reverse, i.e. convert a buffer of sextets back to bytes. As a test I am generating a buffer in bytes using a pseudo-random number generator that creates numbers between 0 and 255 using the built in version available in C. This is in order to simulate binary data. The details of the pseudo-random number generator and how good it is is of little importance, just that a stream of byte with various values is generated. Eventually a binary file will be read.
I've modified the functions in the link: How do I base64 encode (decode) in C? so that instead of encoding bytes to base64 characters, then decoding them back to bytes, sextets are used instead of base64. My encoding functions is as follows:
int bytesToSextets(int inx, int iny, int numBytes, CBYTE* byteData, BYTE* sextetData) {
static int modTable[] = { 0, 2, 1 };
int numSextets = 4 * ((numBytes + 2) / 3);
int i, j;
for (i = inx, j = iny; i < numBytes;) {
BYTE byteA = i < numBytes ? byteData[i++] : 0;
BYTE byteB = i < numBytes ? byteData[i++] : 0;
BYTE byteC = i < numBytes ? byteData[i++] : 0;
UINT triple = (byteA << 0x10) + (byteB << 0x08) + byteC;
sextetData[j++] = (triple >> 18) & 0x3F;
sextetData[j++] = (triple >> 12) & 0x3F;
sextetData[j++] = (triple >> 6) & 0x3F;
sextetData[j++] = triple & 0x3F;
}
for (int i = 0; i < modTable[numBytes % 3]; i++) {
sextetData[numSextets - 1 - i] = 0;
}
return j - iny;
}
where inx is the index in the input byte buffer where I want to start encoding, iny is the index in the output sextet buffer where the beginning of the sextets are written to, numBytes is the number of bytes to be encoded, and *byteData, *sextetData are the respective buffers to read from and write to. The last for-loop sets elements of sextetData to zero, not to '=' as given in the original code when there is padding. Although zero bytes can be valid data, as the length of the buffers are known in advance, I presume this is not a problem. The function returns with the number of sextets written, which can be checked against 4 * ((numBytes + 2) / 3). The first few sextets of the output buffer encode the number of bytes of data encodes in the rest of the buffer, with the number of sextets given in the formula.
The code for decoding sextets back to bytes is as follows:
int sextetsToBytes(int inx, int iny, int numBytes, CBYTE* sextetData, BYTE* byteData) {
int numSextets = 4 * ((numBytes + 2) / 3);
int padding = 0;
if (sextetData[numSextets - 1 + inx] == 0) padding++;
if (sextetData[numSextets - 2 + inx] == 0) padding++;
int i, j;
for (i = inx, j = iny; i < numSextets + inx;) {
UINT sextetA = sextetData[i++];
UINT sextetB = sextetData[i++];
UINT sextetC = sextetData[i++];
UINT sextetD = sextetData[i++];
UINT triple = (sextetA << 18) + (sextetB << 12) + (sextetC << 6) + sextetD;
if (j < numBytes) byteData[j++] = (triple >> 16) & 0xFF;
if (j < numBytes) byteData[j++] = (triple >> 8) & 0xFF;
if (j < numBytes) byteData[j++] = triple & 0xFF;
}
return j - iny - padding;
}
where as before inx and iny are the indices to start reading from and writing to a buffer, numBytes is the number of bytes that will be in the output buffer, from which the number of input sextets are calculated. The length of the input buffer is found from the first few sextets written by bytesToSextets(), so inx is the position in the input sextet buffer to start the actual conversion back to bytes. In the original function the number of sextets is given, from which the number of bytes is calculated using numSextets / 4 * 3. As this is already known, this is not done and should not make a difference. The last two arguments *sextetData and *byteData are the respectively input and output buffers.
An input buffer in bytes is created, converted to sextets, then as a test converted back to bytes. A comparison is made between the generated initial buffer of bytes and the output buffer in bytes after converting back from the intermediate sextet buffer. When the length of the input buffer is a multiple of 3, the match is perfect and the final output buffer is exactly the same. However, if the number of bytes in the initial buffer is not a multiple of 3, the last 3 bytes in the final output buffer may not match the original bytes. This has obviously something to do with the padding when the number of bytes is not a multiple of 3, but I am unable to find the source of the problem. Incidentally, the return values from the two functions are always correct, even when the last few bytes do not match.
In a header file I have the following typedefs:
typedef unsigned char BYTE;
typedef const unsigned char CBYTE;
typedef unsigned int UINT;
Although the main function is more complicated, in its simplest version it would have a form like:
// Allocate memory for bufA and bufB.
// Write the data length and other information into sextets 0 to 4 in bufB.
// Convert the bytes in bufA starting at index 0 to sextets in bufB starting at index 5.
int countSextets = bytesToSextets(0, 5, lenBufA, bufA, bufB);
// Allocate memory for bufC.
// Convert the sextets in bufB starting at index 5 back to bytes in bufC starting at index 0.
int countBytes = sextetsToBytes(5, 0, lenBufC, bufB, bufC);
As I said, this all works correctly, except that when the lenBufA is not a multiple of 3, the last 3 recovered bytes in bufC do not match those in bufA, but the calculated buffer lengths are all correct.
Perhaps someone can kindly help throw some light on this.