2

I'm writing a program that downloads several files at once from several different servers (one download thread per server, of course!). I'm worried about having multiple files growing on disk simultaneously causing disk fragmentation and I'd like to mitigate that by preallocating space on disk for the full file's length (as reported by the Content-Length header) before starting the download, ideally without increasing the file's apparent length (so I can resume failed downloads just by opening the partially downloaded file in append mode).

Is that possible in a platform-independent manner?

wallefan
  • 354
  • 1
  • 11

2 Answers2

1

I did a bit of googling and found this lovely article with some C code to do exactly what you're asking on Windows. Here's that C code translated to ctypes (written for readability):

    import ctypes
    import msvcrt
    # https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-setfileinformationbyhandle
    set_file_information = ctypes.windll.kernel32.SetFileInformationByHandle

    class AllocationInfo(ctypes.Structure):
        _fields_ = [('AllocationSize', ctypes.c_longlong)]
    
    def allocate(file, length):
        """Tell the filesystem to preallocate `length` bytes on disk for the specified `file` without increasing the
        file's length.
        In other words, advise the filesystem that you intend to write at least `length` bytes to the file.
        """
        allocation_info = AllocationInfo(length)
        retval = set_file_information(ctypes.c_long(msvcrt.get_osfhandle(file.fileno())),
                                      ctypes.c_long(5),  # constant for FileAllocationInfo in the FILE_INFO_BY_HANDLE_CLASS enum
                                      ctypes.pointer(allocation_info),
                                      ctypes.sizeof(allocation_info)
                                      )
        if retval != 1:
            raise OSError('SetFileInformationByHandle failed')

This will change the file's Size on disk: as shown in file explorer to the length you specify (plus a few kilobytes for metadata), but leave the Size: unchanged.

However, in the half hour I've spent googling, I've not found a way to do that on POSIX. fallocate() actually does the exact opposite of what you're after: it sets the file's apparent length to the length you give it, but allocates it as a sparse extent on the disk, so writing to multiple files simultaneously will still result in fragmentation. Ironic, isn't it, that Windows has a file management feature that POSIX lacks?

I'd love nothing more than to be proven wrong, but I don't think it's possible.

wallefan
  • 354
  • 1
  • 11
  • Still a valuable answer, but yes, I am curious how this could be done on posix – juanpa.arrivillaga Aug 06 '20 at 02:00
  • @juanpa.arrivillaga In POSIX, you'd use `posix_fallocate`. But since this is Python, "simple is better than complex" :-) – arunanshub Jun 23 '21 at 17:25
  • @arunanshub -- What fallocate() does (and what your answer below does) is create a sparse extent, which is basically a filesystem's way of saying "there are a bunch of zeros here, no need to actually store them". But I do want it to store them -- I want to create a contiguous file on disk with a given size -- because I'm about to write data, and 1) I don't want to have to wait for the FS to decide where to put it, and 2) if I'm writing multiple files at once, I want them to be sequential on disk rather than interleaved so that when I read them one after the other they will load faster. – wallefan Jul 02 '21 at 07:23
  • @wallefan For that (sequential access) you can use `fadvise` or `madvise` (eg `POSIX_FADV_SEQUENTIAL`). `madvise` is recommended as it creates a memory mapping. And *you need to open a file before using fallocate and friends, otherwise EBADF is returned*. For sequential access in python, use `mmap.MADV_SEQUENTIAL`. – arunanshub Jul 03 '21 at 09:48
0
FILENAME = "somefile.bin"
SIZE = 4200000

with open(FILENAME, "wb") as file:
    file.seek(SIZE - 1)
    file.write(b"\0")

Advantages:

  1. Portable across all platforms.
  2. Very efficient if you'd be mmaping (memory-mapping) the files to perform writes on them (via MADV_SEQUENTIAL if sequential access is needed).
arunanshub
  • 581
  • 5
  • 15