0

My Systems Programming project has us implementing a compression/decompression program to crunch down ASCII text files by removing the zero top bit and writing the output to a separate file, depending on whether the compression or decompression routine is working. To do this, the professor has required us to use the binary files and Unix system calls, which include open, close, read, write, etc.

From my understanding of read and write, it reads the binary data by defined byte chunks. However, since this data is binary, I'm not sure how to parse it.

This is a stripped down version of my code, minus the error checking:

void compress(char readFile[]){

  char buffer[BUFFER] //buffer size set to 4096, but tunable to system preference
  int openReadFile;
  openReadFile= open(readFile, O_RDONLY);
}

If I use read to read the data into buffer, will the data in buffer be in binary or character format? Nothing I've come across addresses that detail, and its very relevant to how I parse the contents.

Jason
  • 11,263
  • 21
  • 87
  • 181
  • 2
    There isn't a separate type for `byte` in C. Consider each char you read as a byte. – xanatos Feb 21 '11 at 20:48
  • A `char` is a byte. What do you mean by "binary or character format"? – Matti Virkkunen Feb 21 '11 at 20:24
  • He is probably "mixing" the "text mode" and "binary mode" of `fopen`. `open`, being a primitive, doesn't have a "text mode" (if I remember correctly), so it's always in "binary mode" – xanatos Feb 21 '11 at 20:26
  • @xanatos: ...which doesn't make any difference on any *nix I've used – Matti Virkkunen Feb 21 '11 at 20:26
  • @matti I see on a random manual of fopen: "In many environments, such as most UNIX-based systems, it makes no difference to open a file as a text file or a binary file; Both are treated exactly the same way, but differentiation is recommended for a better portability.". Still, many is not all, and his was an intelligent question (if this one was the question) – xanatos Feb 21 '11 at 20:29
  • @xanatos: I have a bad feeling there's a bigger misunderstanding going on here than just text/binary more for fopen. – Matti Virkkunen Feb 21 '11 at 20:30
  • He might also be thinking of Java or other languages with Unicode support. In those, you must explicitly convert between text characters and binary data because it makes a big difference. – Zan Lynx Feb 21 '11 at 20:35
  • @ZanLynx: There's a big difference between text and bytes in all languages. C just doesn't make it as obvious. – Matti Virkkunen Feb 21 '11 at 20:38
  • @ZanLynx He is probably thinking of Java, because Java seems to be his main platform judging by his questions (and one is VERY explicit: Recommendations for a C beginner book for Java coder) – xanatos Feb 21 '11 at 20:46
  • Xantaos was right on this. First, my only experience with direct file I/O in C is using `fopen`, not `open`. Second, no examples I found with `open` dealt with the data format. Third, I know Java pretty well, and jumping to C is eerily similar to my first programming class. – Jason Feb 21 '11 at 21:29

3 Answers3

2

read() will read the bytes in without any interpretation (so "binary" mode).

Being binary, and you want to access the individual bytes, you should use a buffer of unsigned char unsigned char buffer[BUFFER]. You can regard char/unsigned char as bytes, they'll be 8 bits on linux.

Now, since what you're dealing with is 8 bit ascii compressed down to 7 bit, you'll have to convert those 7 bits into 8 bits again so you can make sense of the data.

To explain what's been done - consider the text Hey .That's 3 bytes. The bytes will have 8 bits each, and in ascii that's the bit patterns :

01001000 01100101 01111001

Now, removing the most significant bit from this, you shift the remaining bits one bit to the left.

X1001000 X1100101 X1111001

Above, X is the bit to removed. Removing those, and shifting the others you end up with bytes with this pattern:

10010001 10010111 11001000

The rightmost 3 bits is just filled in with 0. So far, no space is saved though. There's still 3 bytes. With a string of 8 bytes, we'd saved 1 byte as that would compress down to 7 bytes.

Now you have to do the reverse on the bytes you've read back in

nos
  • 223,662
  • 58
  • 417
  • 506
0

I'll quote the manual of the fopen function (that is based on the open function/primitive) from http://www.kernel.org/doc/man-pages/online/pages/man3/fopen.3.html

The mode string can also include the letter 'b' either as a last character or as a character between the characters in any of the two-character strings described above. This is strictly for compatibility with C89 and has no effect; the 'b' is ignored on all POSIX conforming systems, including Linux

So even the high level function ignores the mode :-)

xanatos
  • 109,618
  • 12
  • 197
  • 280
0

It will read the binary content of the file and load it in the memory buffer points to. Of course, a byte is 8 bits, and that's why a char is 8 bits, so, if the file was a regular plain text document you'll end up with a printable string (be careful with how it ends, read returns the number of bytes (characters in a ascii-encoded plain text file) read).

Edit: in case the file you're reading isn't a text file, and is a collection of binary representations, you can make the type of the buffer the one of the file, even if it's a struct.

Phrodo_00
  • 444
  • 1
  • 5
  • 11