7

I am making a program that reads disk images in C. I am trying to make something portable, so I do not want to use too many OS-specific libraries. I am aware there are many disk images that are very large files but I am unsure how to support these files.

I have read up on fseek and it seems to use a long int which is not guaranteed to support values over 231-1. fsetpos seems to support a larger value with fpos_t but an absolute position cannot be specified. I have also though about using several relative seeks with fseek but am unsure if this is portable.

How can I support portably support large files in C?

Adam
  • 16,808
  • 7
  • 52
  • 98
  • 2
    Break the file into smaller files and do a seek on those? – Mox Jun 05 '15 at 03:42
  • Can you be more specific? It sound like that would have a huge disk I/O cost. – ZeroKelvinKeyboard Jun 05 '15 at 03:43
  • 2
    Relative seeks. You can get to any offset using multiple relative seeks, and as long as you maintain a current position for a file, there's no need for absolute seeks. – Carey Gregory Jun 05 '15 at 03:50
  • I have just googled around. It seems that there are other similar question around. May I refer you to take a look at this? http://stackoverflow.com/questions/11790822/reading-a-large-file-using-c-greater-than-4gb-using-read-function-causing-pro – Mox Jun 05 '15 at 03:51
  • @Mox The question you linked to is about POSIX, I wanted to use the standard library. – ZeroKelvinKeyboard Jun 05 '15 at 04:00
  • For `SEEK_CUR` you can add `fgetpos` to your desired offset and use the result to `fsetpos`. For `SEEK_END` you'll have to do the same, but with the result of `fstat`. – rslemos Jun 05 '15 at 04:04
  • @CareyGregory I have not found anything prohibiting it in the standard. It seems like the best option now. – ZeroKelvinKeyboard Jun 05 '15 at 04:15
  • And you can use non buffered low level IO `open()`/`read()` without these restictions. Also, why do you need `fseek()`? can you please post an example of your code? – Iharob Al Asimi Jun 05 '15 at 04:16
  • Some systems (Linux) has lseek64() and fseek64() calls. You have to define #define _LARGEFILE64_SOURCE macro – Severin Pappadeux Jun 05 '15 at 04:30
  • @iharob stop making edits that make correct expressions incorrect. It's bad enough you did it to the one answer, but here the OP even had parentheses and your edit completely disregarded them. If you want to fix the formatting, great, but don't intentionally introduce errors. – Adam Jun 05 '15 at 04:36
  • @Adam I am really sorry, I wont do another edit unless I am sure. – Iharob Al Asimi Jun 05 '15 at 04:40
  • http://stackoverflow.com/q/14533836/995714 http://stackoverflow.com/q/1035657/995714 – phuclv Jun 05 '15 at 05:35

3 Answers3

7

There is no portable way.

On Linux there are fseeko() and ftello(), pair (need some defines, check ftello()).

On Windows, I believe you have to use _fseeki64() and _ftelli64()

#ifdef is your friend

Iharob Al Asimi
  • 52,653
  • 6
  • 59
  • 97
Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
6

pread() works on any POSIX-compliant platform (OS X, Linux, BSD, etc.). It's missing on Windows but there are lots of standard things that Windows gets wrong; this won't be the only thing in your codebase that needs a Windows special case.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
StilesCrisis
  • 15,972
  • 4
  • 39
  • 62
2

You can't do it with standard C. Even with relative seeks it's not possible on some architectures.

One approach would be to check the platform at compile time. You can just check the value of LONG_MAX and throw a compile error if it's not large enough. But even that doesn't guarantees that the underlying filesystem supports files larger than 2 or 4GB.

A better way is to use the pre-processor macros supplied by your compiler to check the operating system that your code is being compiled for and write operating system specific specific. The operating system should provide a way to check that the filesystem actually supports files larger than 2GB or 4GB.

fernan
  • 349
  • 1
  • 6
  • Please provide citations from the standard showing that relative seeks might not work correctly. – autistic Jun 05 '15 at 04:32
  • 2
    Note that POSIX requires fseek to fail with EOVERFLOW when "the resulting file offset would be a value which cannot be represented correctly in an object of type long" - see http://pubs.opengroup.org/onlinepubs/9699919799/functions/fseek.html – StilesCrisis Jun 05 '15 at 05:04
  • @undefined behaviour, I changed the word architecture to platform on my answer. It's not the C standard that may prevent it from working. The underlying platform or filesystem may not support it, so the only reliable way is to check that it is supported and for that you need platform specific code. – fernan Jun 05 '15 at 05:23
  • @fernan Then the underlying platform would not be standard C compliant. – autistic Jun 05 '15 at 05:29
  • @StilesCrisis So then POSIX is non-compliant, and technically speaking POSIX C is not C..? See http://www.iso-9899.info/n1570.html#7.21.9.2 – autistic Jun 05 '15 at 05:31
  • @undefinedbehaviour: that link doesn't enumerate possible error conditions at all, or what would cause a call to `fseek` to be unable to be satisfied. I don't see anything here which makes POSIX non-conforming. – StilesCrisis Jun 05 '15 at 05:34
  • You're right, it doesn't list any error codes at all, or explicitly call out every case where `fseek` is required to succeed or required to fail. What's your point? The ISO spec isn't exhaustive. – StilesCrisis Jun 05 '15 at 05:36
  • @StilesCrisis It seems that POSIX C violates the C standard in a major way by allowing fseek to fail when "the resulting file offset would be a value which cannot be represented correctly in an object of type long", as you have mentioned. This seems quite intentional, yet at the start of the POSIX manual there's a disclaimer saying "Any conflict between the requirements described here and the ISO C standard is unintentional. This volume of POSIX.1-2008 defers to the ISO C standard." – autistic Jun 05 '15 at 05:37
  • I don't interpret it that way, but you can file a defect report. – StilesCrisis Jun 05 '15 at 05:37
  • 1
    @undefined behaviour, also I can't find in that the standard defines the size limits for fpos_t but the signature of ftell() implies that the position of the stream needs to fit on long and the minimum size of long is (2^31)-1 so the standard doesn't say it's supported either. – fernan Jun 05 '15 at 05:40
  • @StilesCrisis It wouldn't be the first time. There's a subtle inconsistency in the fscanf return value, too. – autistic Jun 05 '15 at 05:43
  • "Then the underlying platform would not be standard C compliant." -- if that's the case then there's no such thing as a C complaint platform since the most common filesystem (FAT) supports up to 4GB files and pretty much every platform supports it. – fernan Jun 05 '15 at 05:46
  • @fernan Right, implying that `ftell` on a stream that has seeked beyond `LONG_MAX` is either implementation-defined behaviour (due to returning the result of a conversion from a wider integer type) or undefined behaviour (due to incrementing a long beyond `LONG_MAX`). Many other functions can cause UB... – autistic Jun 05 '15 at 05:46
  • @undefined behaviour, it means that support for large files is not mandatory and relative seeking beyond it is undefined, normally you get a non-specific error, POSIX fixed this by adding a specific error in that case which doesn't break compliance with the standard, it only extends it. Look at the source code of the common open source standard libraries. On newlib for example, large file support is enabled by the __LARGE_FILES64 define. – fernan Jun 05 '15 at 06:06
  • @fernan The standard documents the ways in which fseek may fail: "If a read or write error occurs, the error indicator for the stream is set and fseek fails." ... "A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END." – autistic Jun 05 '15 at 06:09
  • @fernan The C standard may require that that argument be a long, meaning you can only seek LONG_MIN or LONG_MAX bytes *in one seek*, however it doesn't require that the internal integer type used to add the value corresponding to whence to the offset be long... There's nothing in the C standard demonstrating a failure mode for multiple seeks, except perhaps *read or write error*, which this *overflow* doesn't constitute. – autistic Jun 05 '15 at 06:11
  • @undefinedbehaviour, fact is relative seeks may fail on some platforms/architectures that are stardard compliant. So the standard has not been implemented according to your interpretation. Other standards, like POSIX have extended it to remove this ambiguity without breaking compliance. – fernan Jun 05 '15 at 06:20
  • @fernan I imagine you already know this, but here it is anyway: `printf` returns an `int` representing either a failure mode (negative) or a positive value indicating how many bytes were written. The problem here is that an object may be as large as `SIZE_MAX`, and there's no guarantee that this value will be less than or equal to `INT_MAX`... In fact, for my implementation `SIZE_MAX` is over `2 * INT_MAX`... Thus, if I try to print a string that is over `INT_MAX` bytes in length, it's perfectly compliant for `printf` to return a negative value, despite success? – autistic Jun 05 '15 at 06:38
  • @undefinedbehaviour since the standard says that printf must return the number of characters transmitted as an int I take it as implying that a standard printf must not print more than INT_MAX. It looks like implementations agree with that and actually set saner limits, I tried to fprintf() a 4GB string on amd64 with glibc and it stopped at 1GB returning the correct number of chars written, so I guess an error is when nothing was written. – fernan Jun 05 '15 at 08:20
  • @fernan Then it's okay for the implementation to stop printing the string prior to the end of the string? That seems like another violation... Would you mind showing me your code? – autistic Jun 05 '15 at 08:26