5

In an arbitrary-sized array of bytes in C, I want to store 14-bit numbers (0-16,383) tightly packed. In other words, in the sequence:

0000000000000100000000000001

there are two numbers that I wish to be able to arbitrarily store and retrieve into a 16-bit integer. (in this case, both of them are 1, but could be anything in the given range) If I were to have the functions uint16_t 14bitarr_get(unsigned char* arr, unsigned int index) and void 14bitarr_set(unsigned char* arr, unsigned int index, uint16_t value), how would I implement those functions?

This is not for a homework project, merely my own curiosity. I have a specific project that this would be used for, and it is the key/center of the entire project.

I do not want an array of structs that have 14-bit values in them, as that generates waste bits for every struct that is stored. I want to be able to tightly pack as many 14-bit values as I possibly can into an array of bytes. (e.g.: in a comment I made, putting as many 14-bit values into a chunk of 64 bytes is desirable, with no waste bits. the way those 64 bytes work is completely tightly packed for a specific use case, such that even a single bit of waste would take away the ability to store another 14 bit value)

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
Freezerburn
  • 1,013
  • 3
  • 11
  • 29
  • 1
    The technique you are describing is called "packing bits" or "bit packing." If you know that, finding information on how to do it is much easier. In particular, http://www.catb.org/esr/structure-packing/ – Robert Harvey Oct 17 '15 at 02:23
  • @RobertHarvey - The article linked to seems mostly about ordering of structure members to avoid padding. – rcgldr Oct 17 '15 at 02:31
  • It's about word alignment. You're going to need to know about that. There are plenty of other articles I didn't link. – Robert Harvey Oct 17 '15 at 02:33
  • I don't think two 14-bit numbers will fit into a 16-bit integer. It should be array of 16-bit integers. – MikeCAT Oct 17 '15 at 02:35
  • @MikeCAT: It is. Read the question again. – Robert Harvey Oct 17 '15 at 02:41
  • Definitely an interesting problem. But if you're doing this to save memory, think about how many more bytes of memory will have to be devoted to code to save those two bits of RAM from simply using `int64_t` or `uint16_t`. – Andrew Henle Oct 17 '15 at 04:36
  • @AndrewHenle: While you might be right, my basic idea is to limit the program's total working memory used to 20MB, where I have a 64 byte chunk of memory describing offsets into chunks of memory where a total of 16,000 items can be in those chunks, with 8 bytes reserved in the 64 bytes for a 64 bit "user data" pointer. This gives 32 "pointers" to items in a chunk of memory that can fit into a CPU L1 cache line, with theoretically zero malloc calls, and easy cache-friendly iteration with almost no memory being consumed by the program. (not including bytes spent on compiled code) – Freezerburn Oct 17 '15 at 04:45

5 Answers5

4

Well, this is bit fiddling at its best. Doing it with an array of bytes makes it more complicated than it would be with larger elements because a single 14 bit quantity can span 3 bytes, where uint16_t or anything bigger would require no more than two. But I'll take you at your word that this is what you want (no pun intended). This code will actually work with the constant set to anything 8 or larger (but not over the size of an int; for that, additional type casts are needed). Of course the value type must be adjusted if larger than 16.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define W 14

uint16_t arr_get(unsigned char* arr, size_t index) {
  size_t bit_index = W * index;
  size_t byte_index = bit_index / 8;
  unsigned bit_in_byte_index = bit_index % 8;
  uint16_t result = arr[byte_index] >> bit_in_byte_index;
  for (unsigned n_bits = 8 - bit_in_byte_index; n_bits < W; n_bits += 8)
    result |= arr[++byte_index] << n_bits;
  return result & ~(~0u << W);
}

void arr_set(unsigned char* arr, size_t index, uint16_t value) {
  size_t bit_index = W * index;
  size_t byte_index = bit_index / 8;
  unsigned bit_in_byte_index = bit_index % 8;
  arr[byte_index] &= ~(0xff << bit_in_byte_index);
  arr[byte_index++] |= value << bit_in_byte_index;
  unsigned n_bits = 8 - bit_in_byte_index;
  value >>= n_bits;
  while (n_bits < W - 8) {
    arr[byte_index++] = value;
    value >>= 8;
    n_bits += 8;
  }
  arr[byte_index] &= 0xff << (W - n_bits);
  arr[byte_index] |= value;
}

int main(void) {
  int mod = 1 << W;
  int n = 50000;
  unsigned x[n];
  unsigned char b[2 * n];
  for (int tries = 0; tries < 10000; tries++) {
    for (int i = 0; i < n; i++) {
      x[i] = rand() % mod;
      arr_set(b, i, x[i]);
    }
    for (int i = 0; i < n; i++)
      if (arr_get(b, i) != x[i])
        printf("Err @%d: %d should be %d\n", i, arr_get(b, i), x[i]);
  }
  return 0;
}

Faster versions Since you said in comments that performance is an issue: open coding the loops gives a roughly 10% speed improvement on my machine on the little test driver included in the original. This includes random number generation and testing, so perhaps the primitives are 20% faster. I'm confident that 16- or 32-bit array elements would give further improvements because byte access is expensive:

uint16_t arr_get(unsigned char* a, size_t i) {
  size_t ib = 14 * i;
  size_t iy = ib / 8;
  switch (ib % 8) {
  case 0:
    return (a[iy] | (a[iy+1] << 8)) & 0x3fff;
  case 2:
    return ((a[iy] >> 2) | (a[iy+1] << 6)) & 0x3fff;
  case 4:
    return ((a[iy] >> 4) | (a[iy+1] << 4) | (a[iy+2] << 12)) & 0x3fff;
  }
  return ((a[iy] >> 6) | (a[iy+1] << 2) | (a[iy+2] << 10)) & 0x3fff;
}

#define M(IB) (~0u << (IB))
#define SETLO(IY, IB, V) a[IY] = (a[IY] & M(IB)) | ((V) >> (14 - (IB)))
#define SETHI(IY, IB, V) a[IY] = (a[IY] & ~M(IB)) | ((V) << (IB))

void arr_set(unsigned char* a, size_t i, uint16_t val) {
  size_t ib = 14 * i;
  size_t iy = ib / 8;
  switch (ib % 8) {
  case 0:
    a[iy] = val;
    SETLO(iy+1, 6, val);
    return;
  case 2:
    SETHI(iy, 2, val);
    a[iy+1] = val >> 6;
    return;
  case 4:
    SETHI(iy, 4, val);
    a[iy+1] = val >> 4;
    SETLO(iy+2, 2, val);
    return;
  }
  SETHI(iy, 6, val);
  a[iy+1] = val >> 2;
  SETLO(iy+2, 4, val);
}

Another variation This is quite a bit faster yet on my machine, about 20% better than above:

uint16_t arr_get2(unsigned char* a, size_t i) {
  size_t ib = i * 14;
  size_t iy = ib / 8;
  unsigned buf = a[iy] | (a[iy+1] << 8) | (a[iy+2] << 16);
  return (buf >> (ib % 8)) & 0x3fff;
}

void arr_set2(unsigned char* a, size_t i, unsigned val) {
  size_t ib = i * 14;
  size_t iy = ib / 8;
  unsigned buf = a[iy] | (a[iy+1] << 8) | (a[iy+2] << 16);
  unsigned io = ib % 8;
  buf = (buf & ~(0x3fff << io)) | (val << io);
  a[iy] = buf;
  a[iy+1] = buf >> 8;
  a[iy+2] = buf >> 16;
}

Note that for this code to be safe you should allocate one extra byte at the end of the packed array. It always reads and writes 3 bytes even when the desired 14 bits are in the first 2.

One more variation Finally, this runs just a bit slower than the one above (again on my machine; YMMV), but you don't need the extra byte. It uses one comparison per operation:

uint16_t arr_get2(unsigned char* a, size_t i) {
  size_t ib = i * 14;
  size_t iy = ib / 8;
  unsigned io = ib % 8;
  unsigned buf = ib % 8 <= 2
      ? a[iy] | (a[iy+1] << 8)
      : a[iy] | (a[iy+1] << 8) | (a[iy+2] << 16);
  return (buf >> io) & 0x3fff;
}

void arr_set2(unsigned char* a, size_t i, unsigned val) {
  size_t ib = i * 14;
  size_t iy = ib / 8;
  unsigned io = ib % 8;
  if (io <= 2) {
    unsigned buf = a[iy] | (a[iy+1] << 8);
    buf = (buf & ~(0x3fff << io)) | (val << io);
    a[iy] = buf;
    a[iy+1] = buf >> 8;
  } else {
    unsigned buf = a[iy] | (a[iy+1] << 8) | (a[iy+2] << 16);
    buf = (buf & ~(0x3fff << io)) | (val << io);
    a[iy] = buf;
    a[iy+1] = buf >> 8;
    a[iy+2] = buf >> 16;
  }
}
Gene
  • 46,253
  • 4
  • 58
  • 96
  • I'd be interested in seeing a version that operates on an array of uint16_t. But as it is, this looks to be the best solution for my purposes as it seems to be the fastest solution. (though I wonder if operating on an array of uint16_t would be even faster) – Freezerburn Oct 17 '15 at 05:44
  • @Freezerburn You didn't mention that speed was important. There are probably somewhat faster ways (wild guess 10 to 50%) to code a 14-bit in byte custom solution. Here I was trying for generality. – Gene Oct 17 '15 at 14:11
  • Ah, sorry about that. Do you know of any resources that I could use to build a faster solution, if it becomes necessary? (as it is, under -O3, set takes ~11 nanoseconds and get is ~5 nanoseconds if my timing is correct, considering microbenchmarks are good at lying. this should be plenty for my purposes at least for now) – Freezerburn Oct 17 '15 at 16:09
  • As mentioned before, the switch / case with fixed instruction sequences improves the performance. The example in my answer wasn't fully optimized (uses post increment instead of index + 1), but gives the idea. Array data could be read or written 32 bits at a time, but since much of the time it won't be aligned, I'm not sure that would help much with performance. – rcgldr Oct 17 '15 at 23:57
  • @Freezerburn I added another variation that is 20% faster still on my machine. It doesn't branch at all. – Gene Oct 18 '15 at 02:07
  • @rcgldr I only saw your remark and code after I posted. Great minds... ;-) – Gene Oct 18 '15 at 02:08
  • @Gene I have tried your code (1st version) to extract 37 bit . I have made necessary modifications for that. like W to 37 , data type of return value of arr_get,argument data type of arr_set etc... But its not working properly – jjm Jan 19 '17 at 05:41
  • @JissJ Works fine for me. The code wasn't intended to deal with `W>32`. You probably didn't fix this line: `result |= arr[++byte_index] << n_bits;`. It will fail if `W > 32` because the shift operation is only 32 bits (on most machines). Extra bits are lost. Try `result |= ((uint64_t) arr[++byte_index]) << n_bits;` – Gene Jan 19 '17 at 08:18
  • @Gene OK.. Understood. Thanks – jjm Jan 19 '17 at 12:08
2

The easiest solution is to use a struct of eight bitfields:

typedef struct __attribute__((__packed__)) EightValues {
    uint16_t v0 : 14,
             v1 : 14,
             v2 : 14,
             v3 : 14,
             v4 : 14,
             v5 : 14,
             v6 : 14,
             v7 : 14;
} EightValues;

This struct has a size of 14*8 = 112 bits, which is 14 bytes (seven uint16_t). Now, all you need is to use the last three bits of the array index to select the right bitfield:

uint16_t 14bitarr_get(unsigned char* arr, unsigned int index) {
    EightValues* accessPointer = (EightValues*)arr;
    accessPointer += index >> 3;    //select the right structure in the array
    switch(index & 7) {    //use the last three bits of the index to access the right bitfield
        case 0: return accessPointer->v0;
        case 1: return accessPointer->v1;
        case 2: return accessPointer->v2;
        case 3: return accessPointer->v3;
        case 4: return accessPointer->v4;
        case 5: return accessPointer->v5;
        case 6: return accessPointer->v6;
        case 7: return accessPointer->v7;
    }
}

Your compiler will do the bit-fiddling for you.

cmaster - reinstate monica
  • 38,891
  • 9
  • 62
  • 106
  • Nice try, but this won't usually work by default because the overall structure typically gets extra padding to a word boundary (not guaranteed, but very very common). The safest approach is to expand the number of elements out to 16 (i.e. 14 words) as alignments are usually no stricter than that at the hardware level, even on 64-bit platforms (except when handling 64-bit values). – Donal Fellows Jul 07 '20 at 13:36
  • @DonalFellows The basic data type of the bitfields is `uint16_t` of which 7 will be allocated. As such, I assumed that the whole structure would be padded and aligned to a `uint16_t` boundary. But I agree that I may have been overconfident, the structure should be declared to be packed. I've added that now. – cmaster - reinstate monica Jul 07 '20 at 13:53
1

The Basis for Storage Issue

The biggest issue you are facing is the fundamental question of "What is my basis for storage going to be?" You know the basics, what you have available to you is char, short, int, etc... The smallest being 8-bits. No matter how you slice your storage scheme, it will ultimately have to rest in memory in a unit of memory based on this 8 bit per byte layout.

The only optimal, no bits wasted, memory allocation would be to declare an array of char in the least common multiple of 14-bits. It is the full 112-bits in this case (7-shorts or 14-chars). This may be the best option. Here, declaring an array of 7-shorts or 14-chars, would allow the exact storage of 8 14-bit values. Of course if you have no need for 8 of them, then it wouldn't be of much use anyway as it would waste more than the 4-bits lost on a single unsigned value.

Let me know if this is something you would like to further explore. If it is, I'm happy to help with the implementation.

Bitfield Struct

The comments regarding bitfield packing or bit packing are exactly what you need to do. This can involve a structure alone or in combination with a union, or by manually right/left shifting values directly as needed.

A short example applicable to your situation (if I understood correctly you want 2 14-bit areas in memory) would be:

#include <stdio.h>

typedef struct bitarr14 {
    unsigned n1 : 14,
             n2 : 14;
} bitarr14;

char *binstr (unsigned long n, size_t sz);

int main (void) {

    bitarr14 mybitfield;

    mybitfield.n1 = 1;
    mybitfield.n2 = 1;

    printf ("\n mybitfield in memory : %s\n\n", 
            binstr (*(unsigned *)&mybitfield, 28));

    return 0;
}

char *binstr (unsigned long n, size_t sz)
{
    static char s[64 + 1] = {0};
    char *p = s + 64;
    register size_t i = 0;

    for (i = 0; i < sz; i++) {
        p--;
        *p = (n >> i & 1) ? '1' : '0';
    }

    return p;
}

Output

$ ./bin/bitfield14

 mybitfield in memory : 0000000000000100000000000001

Note: the dereference of mybitfield for purposes of printing the value in memory breaks strict aliasing and it is intentional just for the purpose of the output example.

The beauty, and purpose for using a struct in the manner provided is it will allow direct access to each 14-bit part of the struct directly, without having to manually shift, etc.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • I might have not made it clear what I was asking for: the ability to set/get an arbitrary 14 bits in an array. Unfortunately, this answer does not meet that need, as there is still waste bits (32-28=4) if I were to generate an array of them. If I were to shove as many of these into 64 bytes as I can, I would not want to waste 64 bits (which is 4 more possible 14-bit values). And yes, I do want to shove as many of these into 64 bytes as I can in the project I have in mind. – Freezerburn Oct 17 '15 at 03:00
  • 1
    When someone offers help, if you want more, then the first thing you don't want to do is bite the hand that is feeding you. You were not clear, let's go from there, The easiest way to accomplish what you want without any waste would then be 2 short functions to set and retrieve the bits directly. I'll get an example of that if you can learn to be a bit more diplomatic. – David C. Rankin Oct 17 '15 at 03:05
  • I am sincerely sorry if I sounded like I was being uncivil. I was attempting to clarify the original question (which I have edited into the question) while providing detail as to why your original answer did not fit the question. Unfortunately, text is a terrible medium for conveying tone :( I do appreciate the help, sincerely. – Freezerburn Oct 17 '15 at 03:11
  • That's OK, I'm pretty sure I understood what you meant, it could have probably been worded a bit better. You mention a `short`, but you seem to really want to avoid the waste of `2` bits in each short, that is going to make things a bit more involved. Give me a bit and I'll amended the answer. – David C. Rankin Oct 17 '15 at 03:15
1

Update - assuming you want big endian bit packing. This is code meant for a fixed size code word. It's based on code I've used for data compression algorithms. The switch case and fixed logic helps with performance.

typedef unsigned short uint16_t;

void bit14arr_set(unsigned char* arr, unsigned int index, uint16_t value)
{
unsigned int bitofs = (index*14)%8;
    arr += (index*14)/8;
    switch(bitofs){
        case  0:   /* bit offset == 0 */
            *arr++  = (unsigned char)(value >>  6);
            *arr   &= 0x03;
            *arr   |= (unsigned char)(value <<  2);
            break;
        case  2:   /* bit offset == 2 */
            *arr   &= 0xc0;
            *arr++ |= (unsigned char)(value >>  8);
            *arr    = (unsigned char)(value <<  0);
            break;
        case  4:   /* bit offset == 4 */
            *arr   &= 0xf0;
            *arr++ |= (unsigned char)(value >> 10);
            *arr++  = (unsigned char)(value >>  2);
            *arr   &= 0x3f;             
            *arr   |= (unsigned char)(value <<  6);
            break;
        case  6:   /* bit offset == 6 */
            *arr   &= 0xfc;
            *arr++ |= (unsigned char)(value >> 12);
            *arr++  = (unsigned char)(value >>  4);
            *arr   &= 0x0f;
            *arr   |= (unsigned char)(value <<  4);
            break;
    }
}

uint16_t bit14arr_get(unsigned char* arr, unsigned int index)
{
unsigned int bitofs = (index*14)%8;
unsigned short value;
    arr += (index*14)/8;
    switch(bitofs){
        case  0:   /* bit offset == 0 */
            value  = ((unsigned int)(*arr++)     ) <<  6;
            value |= ((unsigned int)(*arr  )     ) >>  2;
            break;
        case  2:   /* bit offset == 2 */
            value  = ((unsigned int)(*arr++)&0x3f) <<  8;
            value |= ((unsigned int)(*arr  )     ) >>  0;
            break;
        case  4:   /* bit offset == 4 */
            value  = ((unsigned int)(*arr++)&0x0f) << 10;
            value |= ((unsigned int)(*arr++)     ) <<  2;
            value |= ((unsigned int)(*arr  )     ) >>  6;
            break;
        case  6:   /* bit offset == 6 */
            value  = ((unsigned int)(*arr++)&0x03) << 12;
            value |= ((unsigned int)(*arr++)     ) <<  4;
            value |= ((unsigned int)(*arr  )     ) >>  4;
            break;
    }
    return value;
}
rcgldr
  • 27,407
  • 3
  • 36
  • 61
0

Here's my version (updated to fix bugs):

#define PACKWID        14                    // number of bits in packed number
#define PACKMSK    ((1 << PACKWID) - 1)

#ifndef ARCHBYTEALIGN
#define ARCHBYTEALIGN    1                // align to 1=bytes, 2=words
#endif
#define ARCHBITALIGN    (ARCHBYTEALIGN * 8)

typedef unsigned char byte;
typedef unsigned short u16;
typedef unsigned int u32;
typedef long long s64;

typedef u16 pcknum_t;                    // container for packed number
typedef u32 acc_t;                        // working accumulator

#ifndef ARYOFF
#define ARYOFF long
#endif
#define PRT(_val)    ((unsigned long) _val)
typedef unsigned ARYOFF aryoff_t;            // bit offset

// packary -- access array of packed numbers
// RETURNS: old value
extern inline pcknum_t
packary(byte *ary,aryoff_t idx,int setflg,pcknum_t newval)
// ary -- byte array pointer
// idx -- index into array (packed number relative)
// setflg -- 1=set new value, 0=just get old value
// newval -- new value to set (if setflg set)
{
    aryoff_t absbitoff;
    aryoff_t bytoff;
    aryoff_t absbitlhs;
    acc_t acc;
    acc_t nval;
    int shf;
    acc_t curmsk;
    pcknum_t oldval;

    // get the absolute bit number for the given array index
    absbitoff = idx * PACKWID;

    // get the byte offset of the lowest byte containing the number
    bytoff = absbitoff / ARCHBITALIGN;

    // get absolute bit offset of first containing byte
    absbitlhs = bytoff * ARCHBITALIGN;

    // get amount we need to shift things by:
    // (1) our accumulator
    // (2) values to set/get
    shf = absbitoff - absbitlhs;

#ifdef MODSHOW
    do {
        static int modshow;

        if (modshow > 50)
            break;
        ++modshow;

        printf("packary: MODSHOW idx=%ld shf=%d bytoff=%ld absbitlhs=%ld absbitoff=%ld\n",
            PRT(idx),shf,PRT(bytoff),PRT(absbitlhs),PRT(absbitoff));
    } while (0);
#endif

    // adjust array pointer to the portion we want (guaranteed to span)
    ary += bytoff * ARCHBYTEALIGN;

    // fetch the number + some other bits
    acc = *(acc_t *) ary;

    // get the old value
    oldval = (acc >> shf) & PACKMSK;

    // set the new value
    if (setflg) {
        // get shifted mask for packed number
        curmsk = PACKMSK << shf;

        // remove the old value
        acc &= ~curmsk;

        // ensure caller doesn't pass us a bad value
        nval = newval;
#if 0
        nval &= PACKMSK;
#endif
        nval <<= shf;

        // add in the value
        acc |= nval;

        *(acc_t *) ary = acc;
    }

    return oldval;
}

pcknum_t
int_get(byte *ary,aryoff_t idx)
{

    return packary(ary,idx,0,0);
}

void
int_set(byte *ary,aryoff_t idx,pcknum_t newval)
{

    packary(ary,idx,1,newval);
}

Here are benchmarks:

set:    354740751        7.095 -- gene
set:    203407176        4.068 -- rcgldr
set:    298946533        5.979 -- craig

get:    268574627        5.371 -- gene
get:    166839767        3.337 -- rcgldr
get:    207764612        4.155 -- craig
Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • This appears to be a little endian version of bit packing. The OP didn't mention if he wanted big endian or little endian bit packing. It also assumes 32 bit read / writes do not have to be aligned. – rcgldr Oct 17 '15 at 04:56
  • @rcgldr Yes. On a BE arch, after int fetch and before store, just add an endian swap on acc [left off for brevity]. But, BE really only makes sense if an arch is BE [CPU's don't have vacuum tubes, either:-)] (still no problem, because the array may only be accessed via the access function). Virtually all bigint packages do LE. I've written my own from scratch. I used to hate LE, until I compared it in detail--it makes everything much simpler. And, int fetches haven't needed to be aligned on most arches since the 80's. Even the venerable IBM/370 supported unaligned via the ICM inst. – Craig Estey Oct 17 '15 at 05:21
  • I was thinking of standard compression formats, most of which are big endian (BE). I do recall that the backup DAT tape drives used a little endian (LE) compression format, but just about everything else I'm aware of uses big endian format. As for alignment issues, the 68000 series and older ARM series needed aligned data. For others reading this, BE reads sequential data into the low part of a working register and shifts left to get codes, LE reads sequential data into the high part of a working register and shifts right. – rcgldr Oct 17 '15 at 07:12
  • @rcgldr fixed bugs and added word align. Two LEs: arch LE for cell (e.g. int) and LE of bigint vector. arch dictates cell. But, always use LE for vec. When mult n-digit num * m-digit num, you get (n+m) digit num. With vec LE, it's easy to extend vec size via realloc, etc. – Craig Estey Oct 17 '15 at 23:19