-1

I am trying to implement Huffman encoding in C. I am done with the tree construction and obtained the codeword for each symbol as the algorithm proceeds. But now I am stuck with insertion of the codewords into binary files for the corresponding symbol. Can someone suggest how the codeword or binary bits can be written into binary file so that i can obtain the compressed file.

The codewords are of variable length.

A function to write and read these bits to/from the file would be helpful.

This is the code I have written

void create_compressed_file()
{
    char str[20], ch, *str2, str1[10], str_arr[6], str3[10];
    FILE *fp, *fp2, *fp3;
    int i, array[20], j = 0;
    fp2 = fopen("newfile.txt", "r"); // contains the original text file
    fp3 = fopen("codeword.txt", "r"); // contains the symbol and codeword
    while (fscanf(fp2, "%s", &str) == 1) {
        rewind(fp3);
        str2 = strtok(str, "-");
        while (str2 != NULL) {
            strcpy(str_arr, str2);
            printf("str2= %s ", str_arr); //str2 stores the symbol(not char but a string)
            printf(" %s-", str2);
            while (fscanf(fp3, "%s", &str1) == 1) {
                if (strcmp(str1, str_arr) == 0) {
                    fscanf(fp3, "%s", &str1); // extracted corresponding codeword(1s and 0s) of   the symbol and stored it into str1
                    printf("%s\n", str1);
                    write_codeword_to_binaryfile(); // function that i want to create with is   incomplete and need your help.
                }
            }
            str2 = strtok(NULL, "-");
            rewind(fp3);
        }
        printf("\nspace:");
        strcpy(str_arr, "space");
        while (fscanf(fp3, "%s", &str1) == 1) {
            if (strcmp(str1, str_arr) == 0) {
                fscanf(fp3, "\n%s", &str1); // extract the codeword for(space)character  
                printf("%s\n", str1);
            }
        }
    }
    fclose(fp2);
    fclose(fp3);    
}

codeword.txt:

is  0000
por 00010
Plain   000110
most    0001110
the 0001111
ted 00100 
text    00101
ly  0011000
near    0011001
pli 0011010
ap  0011011
ble 0011100
ta  0011101
by  0011110
sup 0011111
cryp    0100000
In  0100001
ra  0100010
tog 0100011
ting    0100100
tain    0100101
mands   0100110
com 0100111
mes 0101000
to  0101001
ge  0101010
sa  0101011
plain   0101100
phy 0101101

I tried the above code as below but it didnt write anything... The file size after execution was 0 bytes:

#include <stdio.h>
#include <conio.h>
#include <stdint.h>

void write_codeword_to_binaryfile(
    const char *codeword, // codeword to write, in ASCII format
    FILE *file,           // destination file
    uint8_t *buffer,
    int *fullness)
{
    char c;
    //  fullness = ;
    *buffer = 0;
    for (c = *codeword++; c != '\0'; c = *codeword++) // iterate
    {
        int bit = c - '0'; // convert from ASCII to binary 0/1
        *buffer |= bit << (7 - fullness);
        ++fullness;
    }
    fputc(*buffer, file);
}

int main() {
    FILE *fp;
    uint8_t *buffer = 0;
    char *c = "10101010";
    char b = 0;
    int i;
    fp = fopen("myfile.bin", "wb");
    write_codeword_to_binaryfile(c, fp, buffer, 8);
    fclose(fp);
    getch();
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
Anib
  • 31
  • 1
  • 3
  • 1
    You need to write a function that groups eight bits into a byte and writes that into the file. – fuz Feb 08 '16 at 16:01
  • 2
    Welcome to Stack Overflow! Please show your research effort till time. Please read [Ask] page first. – Sourav Ghosh Feb 08 '16 at 16:04
  • Check `fread` and `fwrite`. Also it's a common practice to write a new file (the content that you want to leave can be copied in the new one) and add your things in there but leave the original untouched. The reason for this is that if the encoding doesn't work somehow (bad implementation or whatever) you would have lost the original state of the file hence you will be unable to clean up the mess you have created. If you don't need the original file you can always delete it after the encoding was completed and verified as working. – rbaleksandar Feb 08 '16 at 16:04
  • Create a `bitwrite` function that accepts 1 bit and shifts it into an unsigned static buffer byte. When the byte is "full" write it to file and reset the byte buffer and the bit counter. Pass each bit of every Huffman code to the `bitwrite` function. Make sure you are not left with any "dangling" bits at the end that also need to be (shifted?) and written to file. – Weather Vane Feb 08 '16 at 16:09
  • 1
    Show some of your code. What does a codeword look like? – n. m. could be an AI Feb 08 '16 at 16:15
  • I see that you have added your code. This is a good start; maybe someone will already be willing to answer your question. To improve the chances, please show what a codeword looks like. For example, show a few lines from your `codeword.txt` file. – anatolyg Feb 10 '16 at 09:43

1 Answers1

2

First of all, you should open your file in binary mode:

fp = fopen("myfile", "wb"); // "b" means "binary"

This is a must in Windows, but not necessary on most other platforms (you don't need to do anything special to differentiate the platform; just use "wb").

To write bits into the file, you should use a buffer - a partially-filled byte. Write the buffer to the file when it fills up (contains exactly 8 filled bits).

uint8_t buffer = 0;

You should use a counter that tracks how many bits are filled.

int fullness = 0;

Your function, which writes to a file, should receive the buffer and its fullness. Since it will change them, you actually have to send pointers:

void write_codeword_to_binaryfile(
    const char *codeword, // codeword to write, in ASCII format
    FILE *file,           // destination file
    uint8_t *buffer,
    int *fullness)
{
    for (char c = *codeword++; c != '\0'; c = *codeword++) // iterate
    {
        int bit = c - '0'; // convert from ASCII to binary 0/1
        ...
    }
}

There are two ways to arrange bits in a byte - little-endian (first bit is the least-significant bit) or big-endian (first bit is the most-significant bit). The customary way is to use big-endian ordering.

So if your buffer has a certain number of bits filled, how to fill the next bit? The following example shows a buffer with 5 bits filled:

011011...
      ^
next bit to fill (its position, starting from the left, is 2)

As you can see from this example, the position of the next bit is 7 - fullness. So, for each bit, do the following:

*buffer |= bit << (7 - *fullness);
++fullness;

See How do you set, clear and toggle a single bit in C/C++? for more info.

When the buffer is full (fullness is equal to 8), write it to file:

fputc(*buffer, file);
*fullness = 0;
*buffer = 0;

You should also "flush" the buffer (i.e. write it to file) when finished encoding your message:

if (*fullness > 0)
    fputc(*buffer, file);

By the way, what happens at the end of the message is a common non-trivial problem for bit-level encoders. You should think about it from the point of view of the decoder: you need to understand how many bits you should decode in the last byte of the file. There are several solutions for this:

  • After encoding your message, encode an additional 1 bit, and then zero bits until the buffer is full. The decoder will need to decode the zero bits and the 1 bit in reverse. This is used by MPEG.
  • Write the length of the message, in bits, in the file's header. This is probably the simplest solution, although it requires updating the file's beginning after finishing the encoding.
  • Have a special codeword for "end of message" (also often used)
anatolyg
  • 26,506
  • 9
  • 60
  • 134
  • how the function call for write_codeword_to_binaryfile() would look like...what value should I pass for int *fullness and for uint8_t *buffer ?? – Anib Feb 12 '16 at 10:25
  • ok suppose my codeword contains "10101010". Now when i use these binary operation to make up a byte....it stores the decimal value of the codeword i.e 170. still fine though....but when i read back the same byte from the file,it returns me 0. i have tried fread() and fgetc() method.... when i print the returned value on the console ii shows 0. why is it so.. ? please help – Anib Feb 15 '16 at 14:09
  • @anatolyg `for (char c = *codeword++; c != '\0'; c = codeword++)`. `char c = *codeword++; ` skips a character. Logical error likely. `for (char c = *codeword; c != '\0'; c = *(codeword++))` might be better. – Kitswas Jun 08 '22 at 09:19
  • The code looks OK to me, no error that I see. This code `for (char c = *codeword; c != '\0'; c = *(codeword++))` looks like it uses the first item twice. – anatolyg Jun 08 '22 at 13:50