2

Consider this code to read a text based file. This sort of fread() usage was briefly touched upon in the excellent book C Programming: A Modern Approach by K.N. King. There are other methods of reading text based files, but here I am concerned with fread() only.

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    // Declare file stream pointer.
    FILE *fp = fopen("Note.txt", "r");
    // fopen() call successful.
    if(fp != NULL)
    {
        // Navigate through to end of the file.
        fseek(fp, 0, SEEK_END);
        // Calculate the total bytes navigated.
        long filesize = ftell(fp);
        // Navigate to the beginning of the file so
        // it can be read.
        rewind(fp);
        // Declare array of char with appropriate size.
        char content[filesize + 1];
        // Set last char of array to contain NULL char.
        content[filesize] = '\0';
        // Read the file content.
        fread(content, filesize, 1, fp);
        // Close file stream pointer.
        fclose(fp);
        // Print file content.
        printf("%s\n", content);
    }
    // fopen() call unsuccessful.
    else
    {
        printf("File could not be read.\n");
    }
    return 0;
}

There are some problems I have with this method. My opinion is that this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?

To circumvent this issue, we may use a buffer size and keep on reading into a char array of that size. If filesize is less than buffer size, then we simply perform fread() once as described in the above code. Otherwise, We divide the total file size by the buffer size and get a result, whose int portion we will use as the total number of times to iterate a loop where we will invoke fread() each time, appending the read buffer array into a larger string. Now, for the final fread(), which we will perform after the loop, we will have to read exactly (filesize % buffersize) bytes of data into an array of that size and finally append this array into the larger string (Which we would have malloc-ed with filesize + 1 beforehand). I find that if we perform fread() for the last chunk of data using buffersize as its second parameter, then extra garbage data of size (buffersize - chunksize) will be read in and the data might become corrupted. Are my assumptions here correct? Please explain if/ how I have overlooked something.

Also, there is the issue that non-ASCII characters might not have size of 1 byte. In that case I would assume the proper amount is being read, but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?

mindoverflow
  • 730
  • 4
  • 13
  • 1
    Calling [`ftell`](https://en.cppreference.com/w/c/io/ftell) is only guaranteed to give you the number of bytes from the beginning of the file when the file is open in binary mode. However, you are opening the file in text mode. In that mode, the value returned by `ftell` is only meaningful to `fseek`. However, this will probably not be an issue, because the value you get for the file size will rather be too high, than too low. – Andreas Wenzel Apr 23 '22 at 02:55
  • 2
    if there are multibyte characters, is it really a text file? Kind of, sure, but `fread` doesn't handle it, and you can't very well use `char` if you want to have multibyte characters. – Garr Godfrey Apr 23 '22 at 03:04
  • @GarrGodfrey I was under the impression that char is able to handle multi-byte chars like unicode. In that case, instead of being 1 byte, the char is another version with size of 2 bytes. In Windows, this would be ```wchar```. Please correct me if I am mistaken. – mindoverflow Apr 23 '22 at 03:11
  • @AndreasWenzel would you please expand on that? Why would ftell returend value only be meaningful to fseek if we read the file in text mode? ftell does give me the correct size of my text based file as I have tested albeit limitedly. – mindoverflow Apr 23 '22 at 03:13
  • 3
    Failing to check the return from `fread()` and compare against `filesize` is the No. 1 way to get into trouble. On short read, don't forget to check `feof()` and `ferror()` to determine why the failure occurred. You can use `stat()` to get the number of bytes in the file regardless of content type. Then a binary read into `content` will ensure you read the entire file. The `filesize + 1` trick simply allows you to treat `contents` as a string. So long as your file contains single-byte characters, that will work. Note also, `char content[filesize + 1];` is a VLA. – David C. Rankin Apr 23 '22 at 03:19
  • 1
    @mindoverflow: On Microsoft Windows, for example, in text mode, the `\r\n` line endings are converted to `\n`, i.e. 2 bytes are shortened to 1 byte. This means that the number of bytes readable in text mode will not equal the actual file size reported by the operating system. The ISO C standard allows for such behavior, that is why it specifies that the value returned by `ftell` is unspecified and only meaningful to `fseek`, when the file is opened in text mode. – Andreas Wenzel Apr 23 '22 at 03:23
  • @DavidC.Rankin thank you for bringing this to my attention. odd that neither the textbook example nor online examples do this. what type of error might i be facing if i left it unchecked? on a side note, it's pretty interesting that there's a debate on how feof() does not return actual EOF, and that it is just a flag that is activated in certain circumstances. i'd like to hear your and other professional opinion about it sometime. – mindoverflow Apr 23 '22 at 03:25
  • 1
    @DavidC.Rankin: The function `stat` does not exist in ISO C. That function is a POSIX extension. – Andreas Wenzel Apr 23 '22 at 03:26
  • @AndreasWenzel that is quite interesting and also a bit concerning because then is there a way to get the actual file size using c? – mindoverflow Apr 23 '22 at 03:27
  • @AndreasWenzel in your opinion, what would be the ideal and reliable way to read a text based file in c. i don't mind POSIX as it is quite widepread already. feel free to answer as you've addressed all my queries. – mindoverflow Apr 23 '22 at 03:30
  • 2
    @AndreasWenzel - good catch, mindoverflow, see [man 3 fread](https://man7.org/linux/man-pages/man3/fread.3.html), on a short read, `fread()` doesn't distinguish which occurred, so you must call `feof()` and `ferror()` to determine which. There are a number of reasons why an error "can" occur, stream error, disk error, file corruption, etc..I don't have both sides of the `EOF` debate memorized, but from what I recall, `feof()` will test if the `EOF` indicator for the stream is set. The issue is in some cases errors other than end-of-file can set the indicator (though I don't recall what/when) – David C. Rankin Apr 23 '22 at 03:34
  • @DavidC.Rankin great input. i'll refer to the man page from now on. i usually check a couple of sources for c functions, but never thought to check linux man pages since im usually windows based. but i suppose its best to stick to unix based system standards for consistency in these matters. – mindoverflow Apr 23 '22 at 03:38
  • 1
    @mindoverflow: In POSIX, there is no difference between text mode and binary mode. In contrast to Microsoft Windows, POSIX uses `\n` line endings both in text mode and binary mode. Therefore, all problems and restrictions that exist when using text mode are nonexistant on POSIX platforms. For this reason, your way of determining the file size will work on POSIX platforms. All the problems I mentioned apply to ISO C in general, which is only important if you want your code to be portable to other platforms. – Andreas Wenzel Apr 23 '22 at 04:01
  • 1
    @mindoverflow: You may want to read this question: [How do you determine the size of a file in C?](https://stackoverflow.com/q/8236/12149471) Most of the answers are POSIX-specific, but one is Windows-specific too. Your way of determining the file size is also mentioned. – Andreas Wenzel Apr 23 '22 at 04:02
  • 1
    @mindoverflow: The linked question in my previous comment uses the words "file size" in the sense of the length of the file in binary mode (which is the size that the operating system normally reports). However, in ISO C, the only way to determine the file size in text mode is to read the entire file in text mode and count the number of bytes read. – Andreas Wenzel Apr 23 '22 at 04:14
  • I see. thank you for the information. The reason i am not coming across any issues is because i am using the POSIX version of mingw64 on my machine. i pre-emptively decided this because i knew windows somehow messes with established standards and tweaks their own. classic example ```\``` instead of ```/``` for paths – mindoverflow Apr 23 '22 at 04:22
  • 2
    Minor: `printf("File could not be read.\n");` is amiss in description. Better as `printf("File could not be opened.\n");` – chux - Reinstate Monica Apr 23 '22 at 07:54
  • 1
    @chux-ReinstateMonica excellent & important point. thank you. – mindoverflow Apr 30 '22 at 03:23

1 Answers1

3

this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?

fread() does not care about strings (null character terminated arrays). It reads data as if it was in multiples of unsigned char*1 with no special concern to the data content if the stream opened in binary mode and perhaps some data processing (e.g. end-of-line, byte-order-mark) in text mode.

Are my assumptions here correct?

Failed assumptions:

  • Assuming ftell() return value equals the sum of fread() bytes. The assumption can be false in text mode (as OP opened the file) and fseek() to the end is technical undefined behavior in binary mode.

  • Assuming not checking the return value of fread() is OK. Use the return value of fread() to know if an error occurred, end-of-file and how many multiples of bytes were read.

  • Assuming error checking is not required. , ftell(), fread(), fseek() instead of rewind() all deserve error checks. In particular, ftell() readily fails on streams that have no certain end.

  • Assuming no null characters are read. A text file is not certainly made into one string by reading all and appending a null character. Robust code detects and/or copes with embedded null characters.

  • Multi-byte: assuming input meets the encoding requirements. Example: robust code detects (and rejects) invalid UTF8 sequences - perhaps after reading the entire file.

  • Extreme: Assuming a file length <= LONG_MAX, the max value returned from ftell(). Files may be larger.

but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?

fread() does not function on multi-byte boundaries, only multiples of unsigned char. A given fread() may end with a portion of a multi-byte and the next fread() will continue from mid-multi-byte.


Instead of of 2 pass approach consider 1 single pass

// Pseudo code
total_read = 0      
Allocate buffer, say 4096

forever
  if buffer full
    double buffer_size (`realloc()`)
  u = unused portion of buffer 
  fread u bytes into unused portion of buffer
  total_read += number_just_read
  if (number_just_read < u) 
    quit loop

Resize buffer total_read (+ 1 if appending a '\0')

Alternatively consider the need to read the entire file in before processing the data. I do not know the higher level goal, but often processing data as it arrives makes for less resource impact and faster throughput.


Advanced

Text files may be simple ASCII only, 8-bit code page defined, one of various UTF encodings (byte-order-mark, etc. The last line may or may not end with a '\n'. Robust text processing beyond simple ASCII is non-trivial.

ASCII and UTF-8 are the most common. IMO, handle 1 or both of those and error out on anything that does not meet their requirements.


*1 fread() reads in multiple of bytes as per the 3rd argument, which is 1 in OP's case.

//                       v --- multiple of 1 byte
fread(content, filesize, 1, fp);
Andreas Wenzel
  • 22,760
  • 4
  • 24
  • 39
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • 1
    It is worth noting that in plain ISO C, there is no solution to the problem of `length <= LONG_MAX`. In order to solve this problem, platform-specific functionality is required, for example the function [_ftell64](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/ftell-ftelli64?view=msvc-170) on Microsoft Windows. On 64-bit Linux, this is not a problem, as `sizeof(long) == 8`, whereas on Microsoft Windows, it is a problem, as `sizeof(long) == 4`. – Andreas Wenzel Apr 24 '22 at 16:25
  • @AndreasWenzel "there is no solution to the problem of length <= LONG_MAX" --> Agreed. As suggested above, do not use `ftell()` or like - alternatives to determine a file's size, but keep reading until end-of-file. – chux - Reinstate Monica Apr 24 '22 at 16:29
  • Thank you for the excellent answer! I still had some confusion regarding overflow. I read the fread() does not distinguish between chars/ encoding/ etc. But, whatever the file type may be, if it is too large, won't there be an overflow unless we are reading the stream in chunk fashion such that we replace the same array with next piece of information? unless C automatically goes kernel mode and is using virtual paging or some 'temp store - then procure required section on demand' technique? – mindoverflow Apr 30 '22 at 03:29
  • Why is “`fseek()` to the end” “technical _undefined behavior_ in _binary_ mode”? Did you mean text mode? – Paulo1205 Oct 29 '22 at 08:57
  • @Paulo1205 “`fseek()` to the end” is a problem due to C spec: "A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END." § 7.21.9.2 3. Some historic binary files lacked any true length that was not a multiple of file sector size. In effect the file was padded with junk. This is not seen these days. – chux - Reinstate Monica Nov 01 '22 at 14:42