6

Consider the following piece of code for reading the contents of the file into a buffer

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#define BLOCK_SIZE 4096

int main()
{
   int fd=-1;
   ssize_t bytes_read=-1;
   int i=0;
   char buff[50];
   //Arbitary size for the buffer?? How to optimise.
   //Dynamic allocation is a choice but what is the
   //right way to relate the file size to bufffer size.

   fd=open("./file-to-buff.txt",O_RDONLY);
   if(-1 == fd)
   {
      perror("Open Failed");
      return 1;
   }

   while((bytes_read=read(fd,buff,BLOCK_SIZE))>0)
   {
      printf("bytes_read=%d\n",bytes_read);
   }

   //Test to characters read from the file to buffer.The file contains "Hello"
   while(buff[i]!='\0')
   {
      printf("buff[%d]=%d\n",i,buff[i]);
      i++;
      //buff[5]=\n-How?
   }
   //buff[6]=`\0`-How?
   close(fd);
   return 0;
}

Code Description:

  • The input file contains a string "Hello"
  • This content needs to be copied into the buffer.
  • The objective is acheived by open and read POSIX API's.
  • The read API uses a pointer to a buffer of an*arbitary size* to copy the data in.

Questions:

  • Dynamic allocation is the method that must be used to optimize the size of the buffer.What is the right procedure to relate/derive the buffer size from the input file size?
  • I see at the end of the read operation the read has copied a new line character and a NULL character in addition to the characters "Hello". Please elaborate more on this behavior of read.

Sample Output

bytes_read=6

buff[0]=H

buff[1]=e

buff[2]=l

buff[3]=l

buff[4]=o

buff[5]=

PS: Input file is user created file not created by a program (using write API). Just to mention here, in case if it makes any difference.

Vivek Maran
  • 2,623
  • 5
  • 38
  • 52
  • There is no `NULL` character. You propably meant the character `NUL` (aka `'\0'`) which is the same as the integer `0`. – alk Nov 10 '12 at 13:11
  • Read regarding 1 "line feed" (aka `LF`, `'\n'`), "carrige return" (aka `CR`, `'\r'`) and 2 "0-termination" of strings in C. – alk Nov 10 '12 at 13:14
  • Yes, I meant `buff[6]` has `\0` – Vivek Maran Nov 10 '12 at 13:14
  • At first,I use `echo Hello > file-to-buff.txt` to generate the test file,and found a '\n' was added to the end.Then I use vi to create a new file with context "Hello",and also a '\n' too.It's so strange! – prehistoricpenguin Nov 10 '12 at 13:58
  • @chm: Yes. Simple test is generate a text file with just "Hi" in it. size of that file would be three bytes.so `\n` is added somehow to the file. – Vivek Maran Nov 10 '12 at 14:04
  • 1
    @Vivek27:I found that vim will [automatically](http://stackoverflow.com/questions/1050640/vim-disable-automatic-newline-at-end-of-file) add '\n' to the end of file.And if we use `echo -n Hello > file-to-buff.txt`,there won't be a '\n' in the end of the file. – prehistoricpenguin Nov 10 '12 at 15:04
  • `buff[6]` contains `'\0'` by accident; the local variable `buff` was not guaranteed to be zeroed. You should only process the number of bytes read; you should not look for specific values in the area of the buffer that was not filled by `read()`. When you were told that 6 bytes were read, that was all the data in `buff` that was reliably initialized; the rest is quasi-random leftover garbage (or the results of a previous, longer `read()` into the same buffer, or ... garbage). – Jonathan Leffler Nov 10 '12 at 15:58

4 Answers4

7

Since you want to read the whole file, the best way is to make the buffer as big as the file size. There's no point in resizing the buffer as you go. That just hurts performance without good reason.

You can get the file size in several ways. The quick-and-dirty way is to lseek() to the end of the file:

// Get size.
off_t size = lseek(fd, 0, SEEK_END); // You should check for an error return in real code
// Seek back to the beginning.
lseek(fd, 0, SEEK_SET);
// Allocate enough to hold the whole contents plus a '\0' char.
char *buff = malloc(size + 1);

The other way is to get the information using fstat():

struct stat fileStat;
fstat(fd, &fileStat); // Don't forget to check for an error return in real code
// Allocate enough to hold the whole contents plus a '\0' char.
char *buff = malloc(fileStat.st_size + 1);

To get all the needed types and function prototypes, make sure you include the needed header:

#include <sys/stat.h> // For fstat()
#include <unistd.h>   // For lseek()

Note that read() does not automatically terminate the data with \0. You need to do that manually, which is why we allocate an extra character (size+1) for the buffer. The reason why there's already a \0 character there in your case is pure random chance.

Of course, since buf is now a dynamically allocated array, don't forget to free it again when you don't need it anymore:

free(buff);

Be aware though, that allocating a buffer that's as large as the file you want to read into it can be dangerous. Imagine if (by mistake or on purpose, doesn't matter) the file is several GB big. For cases like this, it's good to have a maximum allowable size in place. If you don't want any such limitations, however, then you should switch to another method of reading from files: mmap(). With mmap(), you can map parts of a file to memory. That way, it doesn't matter how big the file is, since you can work only on parts of it at a time, keeping memory usage under control.

Nikos C.
  • 50,738
  • 9
  • 71
  • 96
  • You seldom need `` these days. It used to be crucial, but in this millennium the main headers ensure `` is included if they need information from it. – Jonathan Leffler Nov 10 '12 at 15:46
  • @JonathanLeffler The manpage gives that info. I've become quite paranoid over this, as GCC version 4.4 (or was it 4.5?) started cleaning up their include chains and a vast amount of applications stopped compiling (I'm using a source-based Linux distro, so I felt that.) They assumed that one header is automatically included by another, which was no longer the case. After that, if the docs say what the proper includes are, I follow them to the letter :-) – Nikos C. Nov 10 '12 at 15:55
  • Look at the [POSIX 2008](http://pubs.opengroup.org/onlinepubs/9699919799/toc.htm) specification. Look up [`lseek`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/lseek.html); it says `#include `. Look up [`fstat()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/fstat.html); it says `#include `. I'm not aware of a function entry in POSIX 2008 that requires ``; that isn't saying there isn't one, but it's not one that I've looked at. POSIX (1997) was different: [`lseek()`](http://pubs.opengroup.org/onlinepubs/007908799/xsh/lseek.html). – Jonathan Leffler Nov 10 '12 at 16:05
  • @JonathanLeffler I was looking at `man 2 fstat`, which is the Linux manpage. `man 3p fstat` is the POSIX manpage, and indeed it only lists ``. Not sure why the Linux manpage disagrees. – Nikos C. Nov 10 '12 at 16:13
  • That explains the discrepancy...I think it is close to a doc bug on the `man 2 fstat` page, and a 'harmless' one (nothing will fail to compile if you follow the `man 2 fstat` page), but mainly a hangover from the days of yore when it was necessary. It is good to be careful about which headers are necessary — many people are not and it can bite them — especially in C++. (One possibility is that if you set `#define _XOPEN_SOURCE 500`, you do need the explicit ``.) – Jonathan Leffler Nov 10 '12 at 16:18
  • 1
    Thanks Nikos!! It helps. Just a small suggestion, I beleive `lseek` will return of type `off_t` – Vivek Maran Nov 10 '12 at 16:22
  • @Vivek27 Indeed. And it's important since the error condition is specifically described as `(off_t)-1` and probably signed. – Nikos C. Nov 10 '12 at 16:26
  • Should you be running `memset` on the malloced buffer to make sure that last byte is a null byte? – CMCDragonkai Nov 08 '17 at 07:00
  • @CMCDragonkai No. (And if we wanted that, we'd use `calloc()` instead of `malloc()`.) But since we only need the last byte to be zero, we can just set it to zero. Writing zeroes over the whole memory is not efficient. Just do `buff[size] = '\0';`. Or, if you end up not actually reading the whole file, then `buff[byte_amount_you_actually_read] = '\0';`. In general, just append a `'\0'` to the end of the data. – Nikos C. Nov 08 '17 at 07:19
3

1, you can get the file size with stat(filename, &stat), but define the buffer to page size is just fine

2, first, there is no NULL character after "Hello", it must be accident that the stack area you allocated was 0 before your code executed, please refer to APUE chapter 7.6. In fact you must initialize the local variable before using it.

I tried to generate the text file with vim, emacs and echo -n Hello > file-to-buff.txt, only vim adds a line break automatically

jason.foo
  • 331
  • 1
  • 4
  • 14
  • That's because `vim` edits text files, primarily, and text files end with a newline. And the C standard says that the library could drop the characters after the last newline in the file if it is a text file (but not if it is a binary file). – Jonathan Leffler Nov 10 '12 at 16:30
2

You could consider allocating the buffer dynamically by first creating a buffer of a fixed size using malloc and doubling (with realloc) the size when you fill it up. This would have a good time complexity and space trade off.

At the moment you repeatedly read into the same buffer. You should increase the point in the buffer after each read otherwise you will overwrite the buffer contents with the next section of the file.

The code you supply allocates 50 bytes for the buffer yet you pass 4096 as the size to the read. This could result in a buffer overflow for any files over the size of 50 bytes.

As for the `\n' and '\0'. The newline is probably in the file and the '\0' was just already in the buffer. The buffer is allocated on the stack in your code and if that section of the stack had not been used yet it would probably contain zeros, placed there by the operating system when your program was loaded.

The operating system makes no attempt to terminate the data read from the file, it might be binary data or in a character set that it doesn't understand. Terminating the string, if needed, is up to you.

A few other points that are more a matter of style:

  • You could consider using a for (i = 0; buff[i]; ++i) loop instead of a while for the printing out at the end. This way if anyone messes with the index variable i you will be unaffected.
  • You could close the file earlier, after you finish reading from it, to avoid having the file open for an extended period of time (and maybe forgetting to close it if some kind of error happens).
Will
  • 4,585
  • 1
  • 26
  • 48
  • 1
    +1 for the point about `read(fd, buff, BLOCK_SIZE)` being bad. When the buffer is local (as here), use `read(fd, buff, sizeof(buff))`; when the buffer is provided as an argument (`function(..., char *buffer, size_t buflen, ...)`, use the `buflen` size that is specified. If the function interface doesn't explicitly specify the size of buffer, you have to find out what the implicit buffer size is, which can be tricky. It is best to write an explicit interface. – Jonathan Leffler Nov 10 '12 at 16:34
1

For your second question, read don't add automatically a character '\0'. If you consider that your file is a textual file, your must add a '\0' after calling read, for indicate the end of string.

In C, the end of string is represented by this caracter. If read set 4 characters, printf will read these 4 characters, and will test the 5th: if it's not '\0', it will continue to print until next '\0'. It's also a source of buffer overflow

For the '\n', it is probably in the input file.

BAK
  • 972
  • 8
  • 23