How to read bitN integer data from a binary file?

Question

I have a data file generated by hardware. Some data is 4 bit wide and some is 12 bit wide. Matlab is able to process these data using fread(fp,1,'ubit4=>uint16'). I tried to do it in C++, but it seems that there is no simple way. I can read by byte/int/long/long long and then extract out the requested bits. but it seems inefficient dealing with hundreds mega bytes data.

To generalize this problem, the question is how to read bitN integer (N from 1 to 64 for example)? Can anyone recommend a good way to read this kind of data from file in c++?

Nope, you'll have to write a custom function/class that can handle reading partial bytes. — Mooing Duck, Nov 09 '11 at 19:02
Internally, Matlab also just reads the bytes and interprets them (because the operating system doesn't actually allow anything else). So any inherent inefficiency of doing so is also happening in Matlab. — celtschk, Nov 09 '11 at 19:04

Mooing Duck · Answer 1 · 2011-11-09T20:30:00.283

#include <iostream>
#include <climits>
#include <stdexcept>
#include <cassert>

class bitbuffer {
    char buffer;
    char held_bits;
public:
    bitbuffer() :held_bits(0), buffer(0) {}
    unsigned long long read(unsigned char bits) { 
        unsigned long long result = 0;
        //if the buffer doesn't hold enough bits
        while (bits > held_bits) {
            //grab the all bits in the buffer
            bits -= held_bits;
            result |= ((unsigned long long)buffer) << bits;
            //reload the buffer
            if (!std::cin)
                throw std::runtime_error("");
            std::cin.get(buffer);
            held_bits = (char)std::cin.gcount() * CHAR_BIT;
        }
        //append the bits left to the end of the result
        result |= buffer >> (held_bits-bits);
        //remove those bits from the buffer
        held_bits -= bits;
        buffer &= (1ull<<held_bits)-1;
        return result;
    };
};

int main() {
    std::cout << "enter 65535: ";  
    bitbuffer reader;  //0x3535353335
    assert(reader.read(4) == 0x3);
    assert(reader.read(4) == 0x6);
    assert(reader.read(8) == 0x35);
    assert(reader.read(1) == 0x0);
    assert(reader.read(1) == 0x0);
    assert(reader.read(1) == 0x1);
    assert(reader.read(1) == 0x1);
    assert(reader.read(4) == 0x5);
    assert(reader.read(16) == 0x3335);
    assert(reader.read(8) == 0x0A);
    std::cout << "enter FFFFFFFF: ";
    assert(reader.read(64) == 0x4646464646464646);
    return 0;
}

Note that this reads from std::cin and throws a generic error if it fails, but it shouldn't be too hard to customize those parts depending on your needs.

THank you Mooing, this is pretty much to what I want. Let me try it first and to see if it works. — shangping, Nov 09 '11 at 19:48
@shangping: I'm testing it now. It's buggy. I'll let you know when I got it working right. — Mooing Duck, Nov 09 '11 at 20:15
@shangping: Got it, it now reads properly. The biggest problem was I was assuming a big-endian machine, fix'd by reading one byte at a time. — Mooing Duck, Nov 09 '11 at 20:30
Note that if the bitbuffer object is destructed, you will lose any unread bits in the last byte touched. — Mooing Duck, Nov 09 '11 at 20:42

Paolo Brandoli · Answer 2 · 2011-11-09T20:39:04.913

In my project I have the same requirements of reading N bits from a stream.

The source code is available here: https://bitbucket.org/puntoexe/imebra/src/6a3d67b378c8/project_files/library/base

Or you can download the entire package with documentation from https://bitbucket.org/puntoexe/imebra/downloads and use only the baseClasses. It's open source (FreeBSD) and tested. No other libraries are necessary apart from the STL.

Basically, you create a stream and then connect a streamReader to it. The stream reader is able to read blocks of bytes or the requested amount of bits. Several streamReader objects can be connected to the same stream.

The classes are currently used to read jpeg files or medical image files.

Works on several operating systems (including iOS), big and low endian machines.

Example:

#include "../../library/imebra/include/imebra.h"

// Open the file containing the dicom dataset
ptr<puntoexe::stream> inputStream(new puntoexe::stream);
inputStream->openFile(argv[1], std::ios_base::in);

// Connect a stream reader to the dicom stream. Several stream reader
//  can share the same stream
ptr<puntoexe::streamReader> reader(new streamReader(inputStream));

score 1 · Accepted Answer · answered Nov 10 '11 at 18:43

Thank you all for the contributed answers and they are all very helpful. I am not trying to answer my question and get the credit, but I feel I am obligated to give my feedback on the progress on this question. All credits goes to the above answers.

To achieve the similar function of the matlab fread to read bitN integers, I feel that the template class is not proper, so I came up several functions to deal with <8bit <16bit <32bit and <64bit cases and process them separately.

My idea is: I copy several bytes (from 2 to 8 bytes) to my object and process these bytes and keep unprocessed byte for next processing. Here is my code and testing results (only the <8bit case is implemented):

#include <math.h>
#include <memory.h>
typedef unsigned _int8 _uint8;
typedef unsigned _int16 _uint16;
typedef unsigned _int32 _uint32;
typedef unsigned _int64 _uint64;

class bitbuffer
{
    _uint8 *pbuf;
    _uint8 *pelem; //can be casted to int16/32/64
    _uint32 pbuf_len; //buf length in byte
    _uint32 pelem_len; //element length in byte
    union membuf
    {
        _uint64 buf64;
        _uint32 buf32;
        _uint16 buf16;
        _uint8 buf8[2];
    } tbuf;

    //bookkeeping information
    _uint8 start_bit; //
    _uint32 byte_pos; //current byte position
    _uint32 elem_pos;
public:
    bitbuffer(_uint8 *src,_uint32 src_len,_uint8 *dst,_uint32 dst_len)
    {
        pbuf=src;pelem=dst;
        pbuf_len=src_len;pelem_len=dst_len;
        start_bit=0;byte_pos=0;elem_pos=0;
    } //to define the source and destination
    void set_startbit(_uint8 bit) {start_bit=bit;}
    void set_bytepos(_uint32 pos) {byte_pos=pos;}
    void set_elempos(_uint32 pos) {elem_pos=pos;}
    void reset() {start_bit=0;byte_pos=0;elem_pos=0;} //for restart something from somewhere else
    //OUT getbits(IN a, _uint8 nbits); //get nbits from a using start and byte_pos
    _uint32 get_elem_uint8(_uint32 num_elem,_uint8 nbits) //output limit to 8/16/32/64 only
    {
        _uint32 num_read=0;
        _uint16 mask=pow(2,nbits)-1;//00000111 for example nbit=3 
        while(byte_pos<=pbuf_len-2)
        {
            //memcpy((char*)&tbuf.buf16,pbuf+byte_pos,2); //copy 2 bytes into our buffer, this may introduce redundant copy
            tbuf.buf8[1]=pbuf[byte_pos]; //for little endian machine, swap the bytes
            tbuf.buf8[0]=pbuf[byte_pos+1];
            //now we have start_bits, byte_pos, elem_pos, just finish them all
            while(start_bit<=16-nbits)
            {
                pelem[elem_pos++]=(tbuf.buf16>>(16-start_bit-nbits))&mask;//(tbuf.buf16&(mask<<(16-start_bit))
                start_bit+=nbits; //advance by nbits
                num_read++;
                if(num_read>=num_elem)
                {
                    break;
                }
            }
            //need update the start_bit and byte_pos
            byte_pos+=(start_bit/8);
            start_bit%=8;
            if(num_read>=num_elem)
            {
                break;
            }

        }
        return num_read;
    }
/*  
    _uint32 get_elem_uint16(_uint32 num_elem,_uint8 nbits) //output limit to 8/16/32/64 only
    {
        _uint32 num_read=0;
        _uint32 mask=pow(2,nbits)-1;//00000111 for example nbit=3 
        while(byte_pos<pbuf_len-4)
        {
            memcpy((char*)&tbuf.buf32,pbuf+byte_pos,4); //copy 2 bytes into our buffer, this may introduce redundant copy
            //now we have start_bits, byte_pos, elem_pos, just finish them all
            while(start_bit<=32-nbits)
            {
                pelem[elem_pos++]=(tbuf.buf32>>(32-start_bit-nbits))&mask;//(tbuf.buf16&(mask<<(16-start_bit))
                start_bit+=nbits; //advance by nbits
                num_read++;
                if(num_read>=num_elem)
                {
                    break;
                }
            }
            //need update the start_bit and byte_pos
            start_bit%=8;
            byte_pos+=(start_bit/8);
            if(num_read>=num_elem)
            {
                break;
            }

        }
        return num_read;
    }
    _uint32 get_elem_uint32(_uint32 num_elem,_uint8 nbits) //output limit to 8/16/32/64 only
    {
        _uint32 num_read=0;
        _uint64 mask=pow(2,nbits)-1;//00000111 for example nbit=3 
        while(byte_pos<pbuf_len-8)
        {
            memcpy((char*)&tbuf.buf16,pbuf+byte_pos,8); //copy 2 bytes into our buffer, this may introduce redundant copy
            //now we have start_bits, byte_pos, elem_pos, just finish them all
            while(start_bit<=64-nbits)
            {
                pelem[elem_pos++]=(tbuf.buf64>>(64-start_bit-nbits))&mask;//(tbuf.buf16&(mask<<(16-start_bit))
                start_bit+=nbits; //advance by nbits
                num_read++;
                if(num_read>=num_elem)
                {
                    break;
                }
            }
            //need update the start_bit and byte_pos
            start_bit%=8;
            byte_pos+=(start_bit/8);
            if(num_read>=num_elem)
            {
                break;
            }

        }
        return num_read;
    }

    //not work well for 64 bit!
    _uint64 get_elem_uint64(_uint32 num_elem,_uint8 nbits) //output limit to 8/16/32/64 only
    {
        _uint32 num_read=0;
        _uint64 mask=pow(2,nbits)-1;//00000111 for example nbit=3 
        while(byte_pos<pbuf_len-2)
        {
            memcpy((char*)&tbuf.buf16,pbuf+byte_pos,8); //copy 2 bytes into our buffer, this may introduce redundant copy
            //now we have start_bits, byte_pos, elem_pos, just finish them all
            while(start_bit<=16-nbits)
            {
                pelem[elem_pos++]=(tbuf.buf16>>(16-start_bit-nbits))&mask;//(tbuf.buf16&(mask<<(16-start_bit))
                start_bit+=nbits; //advance by nbits
                num_read++;
                if(num_read>=num_elem)
                {
                    break;
                }
            }
            //need update the start_bit and byte_pos
            start_bit%=8;
            byte_pos+=(start_bit/8);
            if(num_read>=num_elem)
            {
                break;
            }

        }
        return num_read;
    }*/
};

#include <iostream>
using namespace std;

int main()
{
    _uint8 *pbuf=new _uint8[10];
    _uint8 *pelem=new _uint8[80];
    for(int i=0;i<10;i++) pbuf[i]=i*11+11;
    bitbuffer vbit(pbuf,10,pelem,10);

    cout.setf(ios_base::hex,ios_base::basefield);
    cout<<"Bytes: ";
    for(i=0;i<10;i++) cout<<pbuf[i]<<" ";
    cout<<endl;
    cout<<"1 bit: ";
    int num_read=vbit.get_elem_uint8(80,1);
    for(i=0;i<num_read;i++) cout<<(int)pelem[i];
    cout<<endl;
    vbit.reset();
    cout<<"2 bit: ";
    num_read=vbit.get_elem_uint8(40,2);
    for(i=0;i<num_read;i++) cout<<(int)pelem[i]<<" ";
    cout<<endl;
    vbit.reset();
    cout<<"3 bit: ";
    num_read=vbit.get_elem_uint8(26,3);
    for(i=0;i<num_read;i++) cout<<(int)pelem[i]<<' ';
    cout<<endl;
    vbit.reset();
    cout<<"4 bit: ";
    num_read=vbit.get_elem_uint8(20,4);//get 10 bit-12 integers 
    for(i=0;i<num_read;i++) cout<<(int)pelem[i]<<" ";
    cout<<endl;
    vbit.reset();
    cout<<"5 bit: ";
    num_read=vbit.get_elem_uint8(16,5);//get 10 bit-12 integers 
    for(i=0;i<num_read;i++) cout<<(int)pelem[i]<<" ";
    cout<<endl;
    vbit.reset();
    cout<<"6 bit: ";
    num_read=vbit.get_elem_uint8(13,6);//get 10 bit-12 integers 
    for(i=0;i<num_read;i++) cout<<(int)pelem[i]<<" ";
    cout<<endl;
    vbit.reset();
    cout<<"7 bit: ";
    num_read=vbit.get_elem_uint8(11,7);//get 10 bit-12 integers 
    for(i=0;i<num_read;i++) cout<<(int)pelem[i]<<" ";
    cout<<endl;
    vbit.reset();
    cout<<"8 bit: ";
    num_read=vbit.get_elem_uint8(10,8);//get 10 bit-12 integers 
    for(i=0;i<num_read;i++) cout<<(int)pelem[i]<<" ";
    cout<<endl;
    vbit.reset();

    return 0;
}

testing results:

Bytes: b 16 21 2c 37 42 4d 58 63 6e
1 bit: 0000101100010110001000010010110000110111010000100100110101011000011000110
1101110
2 bit: 0 0 2 3 0 1 1 2 0 2 0 1 0 2 3 0 0 3 1 3 1 0 0 2 1 0 3 1 1 1 2 0 1 2 0 3 1
 2 3 2
3 bit: 0 2 6 1 3 0 4 1 1 3 0 3 3 5 0 2 2 3 2 5 4 1 4 3
4 bit: 0 b 1 6 2 1 2 c 3 7 4 2 4 d 5 8 6 3 6 e
5 bit: 1 c b 2 2 b 1 17 8 9 6 15 10 18 1b e
6 bit: 2 31 18 21 b 3 1d 2 13 15 21 23
7 bit: 5 45 44 12 61 5d 4 4d 2c 18 6d
8 bit: b 16 21 2c 37 42 4d 58 63 6e
Press any key to continue

Kerrek SB · Answer 4 · 2011-11-09T19:33:26.337

Your question isn't very specific, so I can only recommend some general ideas.

You might like to read the file in chunks, for example 4096 bytes at a time (that's the typical page size) -- though larger chunks should also be fine (maybe 64kiB or 512kiB even, just experiment). Once you got a chunk read, process it from memory.

To be correct, we should generate the chunk memory as an array of the target integer. For example, for 4-byte integers we could do this:

#include <cstdint>
#include <memory>
#include <cstdio>

uint32_t buf[1024];

typedef std::unique_ptr<std::FILE, int (*)(std::FILE *)> unique_file_ptr;

static unique_file_ptr make_file(const char * filename, const char * flags)
{
  std::FILE * const fp = std::fopen(filename, flags);
  return unique_file_ptr(fp ? fp : nullptr, std::fclose);
}

int main()
{
  auto fp = make_file("thedata.bin", "rb");

  if (!fp) return 1;

  while (true)
  {
    if (4096 != std::fread(reinterpret_cast<char*>(buf), 4096, fp.get())) break;
    // process buf[0] up to buf[1023]
  }
}

I chose the C-library fopen/fread over C++ iostreams for performance reasons; I cannot actually claim that that decision is based on personal experience. (If you have an old compiler, you might need the header <stdint.h> instead, and perhaps you don't have unique_ptr, in which case you cah just use std::FILE* and std::fopen manually.)

Alternative to the global buf, you could also make an std::vector<uint32_t>, resize it to something large enough and read into its data buffer directly (&buf[0] or buf.data()).

If you need to read integers that are not 2, 4, 8 or 16 bytes long, you'll have to read into a char array and extract the numbers manually with algebraic operations (e.g. buf[pos] + (buf[pos + 1] << 8) + (buf[pos + 2] << 16) for a 3-byte integer). If your packing isn't even byte-aligned, you'll have to make an even greater effort.

score 0 · Answer 5 · answered Nov 09 '11 at 21:47

Below is an example on how to get a range of bits from a variable.

The example simulates that some binary data has been read from a file and stored in a vector<unsigned char>.

The data is copied(extract function template) from the vector into a variable. After that the get_bits function returns the requested bits into a new variable. Simple as that!

#include <vector>

using namespace std;

template<typename T>
T extract(const vector<unsigned char> &v, int pos)
{
  T value;
  memcpy(&value, &v[pos], sizeof(T));
  return value;
}

template<typename IN, typename OUT>
OUT get_bits(IN value, int first_bit, int last_bit)
{
  value = (value >> first_bit);
  double the_mask = pow(2.0,(1 + last_bit - first_bit)) - 1;
  OUT result = value & static_cast<IN>(the_mask);
  return result;
}

int main()
{
  vector<unsigned char> v;
  //Simulate that we have read a binary file.
  //Add some binary data to v.
  v.push_back(255);
  v.push_back(1);
  //0x01 0xff
  short a = extract<short>(v,0);

  //Now get the bits from the extracted variable.
  char b = get_bits<short,char>(a,8,8);
  short c = get_bits<short,short>(a,2,5);
  int d = get_bits<short,int>(a,0,7);

  return 0;
}

This is just a simple example without any error checking.

You can use the extract function template to get data starting at any position in the vector. This vector only has 2 elements and the size of a short is 2 bytes so that's why the pos argument is 0 for the extract function.

Good luck!

This is a neat solution and is what I am pursuing. The thought is: store the binary data in a vector of char and then take data from it to process m bitN integers and store it into a type T vector. and a class will bookkeeping the information needed for the batch process. This code, however, did not consider to batch process, and also the remaining unprocessed bits for next reading. Just my two cents. Thank you — shangping, Nov 09 '11 at 22:29
Well the code did not consider batch processing because it was not in the question, was it? But I think you could add this easily. The functions allows you to extract an "arbitrary" length of data from the vector. Then the get_bits function lets you decide what start and stop position of your current "bit-region of interest" is. Then just set the region to be the next region that you are interested in and get the bits. Hope it helps! — mantler, Nov 10 '11 at 08:19

How to read bitN integer data from a binary file?

5 Answers5

Linked