5

I'm trying to implement the Huffman's encoding algorithm in c++.

my question is : after i got the equivalent binary string for each character , how can i write those zeros and ones as binary on a file not as string 0 or string 1 ?

thanks in advance ...

HSN
  • 783
  • 3
  • 11
  • 20
  • Just to make things clear . what i mean by binary equivalent string is this : for example if A was encoded by 010 i want to write 010 on the file as binary 0 and binary one so the total is 3 bits NOT 24 bit (3 byte) each one with ASCII equivalent binary for the character 0 and character 1 . – HSN Jun 24 '12 at 11:10
  • what we don't understand is what you have, not what you will need at the end. – akappa Jun 24 '12 at 12:46
  • i have a data structure(user defined) that contains 3 fields ; character ,frequency and equivalent bit encoding . now , first I'm going to read the text from a text file and fill the frequency field for each character , convert the data structure to a binary tree then traverse the tree to find the equivalent bit encoding for each character. Finally (which is my question) : i want to produce a compressed version of the original text file by reading each character from the original text and write its equivalent bit string (using the binary tree) on a binary file. – HSN Jun 24 '12 at 14:06
  • then you could use the `bitstream` class I included below to write the bits of your encodings using `push_bit(bool)`, then write it on file by writing the `(size() + 7) / 8` bytes of the array returned by `get_array()`. – akappa Jun 24 '12 at 14:10

3 Answers3

1

Obtaining individually the encoding of each character in a different data structure is a broken solution, because you need to juxtapose the encoding of each character in the resulting binary file: storing them individually makes that as hard as directly storing them contiguously in a vector of bits.

This consideration suggests using a std::vector<bool> to perform your task, but it is a broken solution because it can't be treated as a c-style array, and you really need that at output time.

This question asks precisely which are the valid alternatives to std::vector<bool>, so I think answers to that question fits perfectly your question.

BTW, what I would do is to just wrap a std::vector<uint8_t> under a class which suits yout needs, like the code attached:

#include <iostream>
#include <vector>
#include <cstdint>
#include <algorithm>
class bitstream {
private:
    std::vector<std::uint8_t> storage;
    unsigned int bits_used:3;
    void alloc_space();
public:
    bitstream() : bits_used(0) { }

    void push_bit(bool bit);

    template <typename T>
    void push(T t);

    std::uint8_t *get_array();

    size_t size() const;

    // beware: no reference!
    bool operator[](size_t pos) const;
};

void bitstream::alloc_space()
{
    if (bits_used == 0) {
        std::uint8_t push = 0;
        storage.push_back(push);
    }
}

void bitstream::push_bit(bool bit)
{
    alloc_space();
    storage.back() |= bit << 7 - bits_used++;
}

template <typename T>
void bitstream::push(T t)
{
    std::uint8_t *t_byte = reinterpret_cast<std::uint8_t*>(&t);
    for (size_t i = 0; i < sizeof(t); i++) {
        uint8_t byte = t_byte[i];
        if (bits_used > 0) {
            storage.back() |= byte >> bits_used;
            std::uint8_t to_push = (byte & ((1 << (8 - bits_used)) - 1)) << bits_used;
            storage.push_back(to_push);
        } else {
            storage.push_back(byte);
        }
    }
}

std::uint8_t *bitstream::get_array()
{
    return &storage.front();
}

size_t bitstream::size() const
{
    const unsigned int m = 0;
    return std::max(m, (storage.size() - 1) * 8 + bits_used);
}

bool bitstream::operator[](size_t size) const
{
    // No range checking
    return static_cast<bool>((storage[size / 8] >> 7 - (size % 8)) & 0x1);
}

int main(int argc, char **argv)
{
    bitstream bs;
    bs.push_bit(true);
    std::cout << bs[0] << std::endl;
    bs.push_bit(false);
    std::cout << bs[0] << "," << bs[1] << std::endl;
    bs.push_bit(true);
    bs.push_bit(true);
    std::uint8_t to_push = 0xF0;
    bs.push_byte(to_push);
    for (size_t i = 0; i < bs.size(); i++)
        std::cout << bs[i] << ",";
    std::cout << std::endl;
}
Community
  • 1
  • 1
akappa
  • 10,220
  • 3
  • 39
  • 56
1

I hope this code can help you.

  • You start from a sequence of bytes (1s and 0s) representing the continuous encoding of every character of the input file.
  • You take every byte of the sequence and add a bit into a temporary byte (char byte)
  • Every time you fill a byte, you write it to file (you could also wait, for efficiency, to have a bigger data)
  • At the end, you write the remaining bits to file, filled with trailing zeros, for example
  • As akappa correctly pointed out, the else branch can be removed if byte is set to 0 after each file writing operation (or, more generically, every time it has been totally filled and flushed somewhere else), so only 1s must be written.

void writeBinary(char *huffmanEncoding, int sequenceLength)
{
    char byte = 0;
    // For each bit of the sequence
    for (int i = 0; i  < sequenceLength; i++) {
        char bit = huffmanEncoding[i];

        // Add a single bit to byte
        if (bit == 1) {
            // MSB of the sequence to msb of the file
            byte |= (1 << (7 - (i % 8)));
            // equivalent form: byte |= (1 << (-(i + 1) % 8);
        }
        else {
            // MSB of the sequence to msb of the file
            byte &= ~(1 << (7 - (i % 8)));
            // equivalent form: byte &= ~(1 << (-(i + 1) % 8);
        }

        if ((i % 8) == 0 && i > 0) {
            //writeByteToFile(byte);
        }
    }

    // Fill the last incomplete byte, if any, and write to file
}
Vincenzo Pii
  • 18,961
  • 8
  • 39
  • 49
  • So you are assuming that each encoding is stored with a byte representing each bit? Sounds like a spectacular waste of memory. btw, you are outputting the bits in the wrong order (it should be `(-i) % 8`) and you could erase the `else` branch by clearing `byte` at the beginning at after each writing. – akappa Jun 24 '12 at 10:49
  • @akappa That is what the OP has as input: "i got the equivalent binary string for each character". Also, I do not agree with the bit ordering problem, I am outputting them as they are in the encoding string. First byte of the encoding string, will be first bit in file. Good point for clearing the else. – Vincenzo Pii Jun 24 '12 at 10:51
  • the first bit in file will be the most significant bit of your byte, while you first one will be the least significant one of the "virtual array of bytes" you get by consolidating all that bits in a proper array of bytes. – akappa Jun 24 '12 at 10:54
  • PS: I interpreted his "binary string" differently than you did (an array of bytes for each character), but I agree that his wording is ambiguous. – akappa Jun 24 '12 at 10:57
  • @akappa, yes. However, I changed the ordering of bits, as you were totally right about it. – Vincenzo Pii Jun 24 '12 at 10:58
  • Why is the last if statement not ( (i+1) % 8 ==0)? Since we are executing it at the end of the loop, won't it execute for the first time after the 9th bit has been read in? – ajspencer Jun 01 '16 at 13:41
0

You cant write to a binary file with only bits; the smallest size of data written is one byte (thus 8 bits).

So what you should do is create a buffer (any size).

char BitBuffer;

Writing to a buffer:

int Location;
bool Value;

if (Value)
    BitBuffer |= (1 << Location);
else
    BitBuffer &= ~(1 << Location)

The code (1 << Location) generates a number with all 0's except the position specified by Location. Then, if Value is set to true, it sets corresponding bit in Buffer to 1, and to 0 in other case. The binary operations used are fairly simple, if you don't understand them, it should be in any good C++ book/tutorial.

Location should be number in range <0, sizeof(Buffer)-1>, so <0,7> in this case.

Writing buffer to a file is relatively simple when using fstream. Just remember to open it as binary.

ofstream File;
File.open("file.txt", ios::out | ios::binary);
File.write(BitBuffer, sizeof(char))

EDIT: Noticed a bug and fixed it.

EDIT2: You can't use << operators in binary mode, i forgot about it.

Alternative solution : Use std::vector<bool> or std::bitset as a buffer.

This should be even simpler, but I thought I could help you a little bit more.

void WriteData (std::vector<bool> const& data, std::ofstream& str)
{
    char Buffer;
    for (unsigned int i = 0; i < data.size(); ++i)
    {
       if (i % 8 == 0 && i != 0)
           str.write(Buffer, 1);
       else
           // Paste buffer setting code here
           // Location = i/8;
           // Value = data[i];
    }
    // It might happen that data.size() % 8 != 0. You should fill the buffer
    // with trailing zeros and write it individually.
}
Bartek Banachewicz
  • 38,596
  • 7
  • 91
  • 135
  • thanks sir , but i have few questions if you don't mind . first what dose this line of code do ?? int Location; BitBuffer &= BitBuffer && (1 << Location); and the other question is : wouldn't this write the binary equivalent ASCII string to the zero or one not binary 0 and binary one ? – HSN Jun 24 '12 at 09:58
  • simplistic answer: (1) a buffer of just a char is useless (2) std::vector is totally broken, and (3) std::bitset isn't expendable. – akappa Jun 24 '12 at 10:32
  • @akappa i was under impression that `std::vector` works when indexed with it's own `[]` operator, so the above example should work, even if it's far from optimal. Also, i explicitly stated that the buffer can be larger than just `char` - so the OP could end up in C-style array anyway. The downvotes seem pretty unfair to me. – Bartek Banachewicz Jun 24 '12 at 10:37
  • You said it: it's far from optimal. Unless you want a toy implementation, you should write your encodings **directly** in something which is readily outputtable. There are solutions which does that (I wrote one for a compressor), so I don't think this is a good answer. – akappa Jun 24 '12 at 10:52