Converting binary data in bytes to sextets and the reverse in C

Question

I want to convert a buffer of binary data in bytes into a buffer of sextets, where a sextet is a byte with the two most significant bits set to zero. I also want to do the reverse, i.e. convert a buffer of sextets back to bytes. As a test I am generating a buffer in bytes using a pseudo-random number generator that creates numbers between 0 and 255 using the built in version available in C. This is in order to simulate binary data. The details of the pseudo-random number generator and how good it is is of little importance, just that a stream of byte with various values is generated. Eventually a binary file will be read.

I've modified the functions in the link: How do I base64 encode (decode) in C? so that instead of encoding bytes to base64 characters, then decoding them back to bytes, sextets are used instead of base64. My encoding functions is as follows:

int bytesToSextets(int inx, int iny, int numBytes, CBYTE* byteData, BYTE* sextetData) {
  static int modTable[] = { 0, 2, 1 };
  int numSextets = 4 * ((numBytes + 2) / 3);

  int i, j;
  for (i = inx, j = iny; i < numBytes;) {

    BYTE byteA = i < numBytes ? byteData[i++] : 0;
    BYTE byteB = i < numBytes ? byteData[i++] : 0;
    BYTE byteC = i < numBytes ? byteData[i++] : 0;

    UINT triple = (byteA << 0x10) + (byteB << 0x08) + byteC;

    sextetData[j++] = (triple >> 18) & 0x3F;
    sextetData[j++] = (triple >> 12) & 0x3F;
    sextetData[j++] = (triple >> 6) & 0x3F;
    sextetData[j++] = triple & 0x3F;

  }

  for (int i = 0; i < modTable[numBytes % 3]; i++) {
    sextetData[numSextets - 1 - i] = 0;
  }

  return j - iny;
}

where inx is the index in the input byte buffer where I want to start encoding, iny is the index in the output sextet buffer where the beginning of the sextets are written to, numBytes is the number of bytes to be encoded, and *byteData, *sextetData are the respective buffers to read from and write to. The last for-loop sets elements of sextetData to zero, not to '=' as given in the original code when there is padding. Although zero bytes can be valid data, as the length of the buffers are known in advance, I presume this is not a problem. The function returns with the number of sextets written, which can be checked against 4 * ((numBytes + 2) / 3). The first few sextets of the output buffer encode the number of bytes of data encodes in the rest of the buffer, with the number of sextets given in the formula.

The code for decoding sextets back to bytes is as follows:

int sextetsToBytes(int inx, int iny, int numBytes, CBYTE* sextetData, BYTE* byteData) {
  int numSextets = 4 * ((numBytes + 2) / 3);
  int padding = 0;

  if (sextetData[numSextets - 1 + inx] == 0) padding++;
  if (sextetData[numSextets - 2 + inx] == 0) padding++;

  int i, j;
  for (i = inx, j = iny; i < numSextets + inx;) {
    UINT sextetA = sextetData[i++];
    UINT sextetB = sextetData[i++];
    UINT sextetC = sextetData[i++];
    UINT sextetD = sextetData[i++];

    UINT triple = (sextetA << 18) + (sextetB << 12) + (sextetC << 6) + sextetD;

    if (j < numBytes) byteData[j++] = (triple >> 16) & 0xFF;
    if (j < numBytes) byteData[j++] = (triple >> 8) & 0xFF;
    if (j < numBytes) byteData[j++] = triple & 0xFF;
  }

  return j - iny - padding;
}

where as before inx and iny are the indices to start reading from and writing to a buffer, numBytes is the number of bytes that will be in the output buffer, from which the number of input sextets are calculated. The length of the input buffer is found from the first few sextets written by bytesToSextets(), so inx is the position in the input sextet buffer to start the actual conversion back to bytes. In the original function the number of sextets is given, from which the number of bytes is calculated using numSextets / 4 * 3. As this is already known, this is not done and should not make a difference. The last two arguments *sextetData and *byteData are the respectively input and output buffers.

An input buffer in bytes is created, converted to sextets, then as a test converted back to bytes. A comparison is made between the generated initial buffer of bytes and the output buffer in bytes after converting back from the intermediate sextet buffer. When the length of the input buffer is a multiple of 3, the match is perfect and the final output buffer is exactly the same. However, if the number of bytes in the initial buffer is not a multiple of 3, the last 3 bytes in the final output buffer may not match the original bytes. This has obviously something to do with the padding when the number of bytes is not a multiple of 3, but I am unable to find the source of the problem. Incidentally, the return values from the two functions are always correct, even when the last few bytes do not match.

In a header file I have the following typedefs:

typedef unsigned char BYTE;
typedef const unsigned char CBYTE;
typedef unsigned int UINT;

Although the main function is more complicated, in its simplest version it would have a form like:

// Allocate memory for bufA and bufB.
// Write the data length and other information into sextets 0 to 4 in bufB.

// Convert the bytes in bufA starting at index 0 to sextets in bufB starting at index 5.
int countSextets = bytesToSextets(0, 5, lenBufA, bufA, bufB);

// Allocate memory for bufC.

// Convert the sextets in bufB starting at index 5 back to bytes in bufC starting at index 0.
int countBytes = sextetsToBytes(5, 0, lenBufC, bufB, bufC);

As I said, this all works correctly, except that when the lenBufA is not a multiple of 3, the last 3 recovered bytes in bufC do not match those in bufA, but the calculated buffer lengths are all correct.

Perhaps someone can kindly help throw some light on this.

To be honest, seems to me like all you have to do is replace the encoding table of base64 with an array `{0, 1, 2, 3, ... 63}` and you got what you want. Unless I am misunderstanding what you want, because you aren't really explaining how you want your sextets to be encoded. You could simply replace that `encodingTable` in the question you linked and be done with it. Besides, why pass an extra starting index to the function? If you want to start from some index, then just pass `bufA + index` to the function instead of `bufA`, that's easier and straightforward. — Marco Bonelli, Aug 25 '22 at 12:16
@Marco has the easiest solution... Another consideration is that base64 uses "out of range" '=' to pad to modulo 4 minipackets. Have you considered "Big/Little Endian"-ness of the binary data? Your file will be, you say, binary data becoming binary data, but at 33% more storage space.... What's the point? — Fe2O3, Aug 25 '22 at 12:23
There was no reason to complicate the routines with offsets into the buffers, the `inx` and `iny` parameters. All that is necessary to write into a particular offset in a buffer is to add the offset to the buffer when calling the routine, as with passing `bufB + 5` instead of passing `bufB` and 5 separately. If you really want to have the caller pass the two values separately, then using `sextetData += iny;` at the start of the routine would have been simpler than threading `iny` into every use of `sextetData` throughout the routine. Simplifying the code that way would have avoided this bug. — Eric Postpischil, Aug 25 '22 at 12:26
In the future: Test each routine separately and narrow the problem down to one. Doing so would have revealed `bytesToSextets` is at fault, and you would not have needed to post the other routine. Test the parameters to each routine. Testing with `iny` set to zero would have revealed a problem with it, and then inspecting the code would quickly reveal the problem. Do elementary debugging before coming to Stack Overflow. — Eric Postpischil, Aug 25 '22 at 12:28

score 0 · Answer 1 · answered Aug 25 '22 at 12:22

0

sextetData[numSextets - 1 - i] = 0; should be sextetData[iny + numSextets - 1 - i] = 0;.

answered Aug 25 '22 at 12:22

Eric Postpischil

195,579
13
168
312

Many thanks to Eric for your very quick reply. I was traveling or otherwise occupied and unable to reply for a few days. Making the change you suggested fixed the bug, However, I changed the call of the function by including the offsets, so the change in the end was not necessary. – csharp Aug 28 '22 at 13:53

score 0 · Answer 2 · answered Aug 28 '22 at 14:14

The version of sextetsToBytes() I originally posted had the problem that I tested for padding by using:

if (sextetData[numSextets - 1 + inx] == 0) padding++;
if (sextetData[numSextets - 2 + inx] == 0) padding++;

as of course testing for '=' for base64 cannot be used, however, testing for zero can still cause problems, as zero can be a valid data item. This indeed sometimes caused a difference between the specified number of output bytes and the number found by counting up the bytes in the loop and subtracting the padding bytes. By just removing the padding bytes from the function, then checking the counted number returned against the specified input value numBytes, works. The modified code is as follows:

int sextetsToBytes(int numBytes, CBYTE* sextetData, BYTE* byteData) {
  int numSextets = 4 * ((numBytes + 2) / 3);

  int i, j;
  for (i = 0, j = 0; i < numSextets;) {
    UINT sextetA = sextetData[i++];
    UINT sextetB = sextetData[i++];
    UINT sextetC = sextetData[i++];
    UINT sextetD = sextetData[i++];

    UINT triple = (sextetA << 18) + (sextetB << 12) + (sextetC << 6) + sextetD;

    if (j < numBytes) byteData[j++] = (triple >> 16) & 0xFF;
    if (j < numBytes) byteData[j++] = (triple >> 8) & 0xFF;
    if (j < numBytes) byteData[j++] = triple & 0xFF;
  }

  return j;
}

Converting binary data in bytes to sextets and the reverse in C

2 Answers2