2

I'm writing a program that uses dynamic programming to solve a difficult problem. The DP solution requires storing a large table. The full table occupies approximately 300 Gb. Physically it is stored in 40 ~7Gb files. I'm marking unused table entries with the byte \xFF. I'd like to allocate space for this table quickly. The program will have to run both under Windows and Linux.

In short, I want to efficiently create large files filled with a specific byte in a cross-platform manner.

Here is the code I'm currently using:

def reset_storage(self, path):
    fill = b'\xFF'

    with open(path, 'wb') as f:
        for _ in range(3715948544 * 2):
            f.write(fill)

It takes it about 40 minutes to create one 7 Gb file. How do I speed it up?

I've taken a look at other questions, but none of them seem to be relevant:

Pastafarianist
  • 833
  • 11
  • 27

2 Answers2

5

Write blocks, not bytes, and avoid iterating huge ranges for no reason.

import itertools

def reset_storage(self, path):
    total = 3715948544 * 2
    block_size = 4096  # Tune this if needed, just make sure it's a factor of the total
    fill = b'\xFF' * block_size

    with open(path, 'wb') as f:
        f.writelines(itertools.repeat(fill, total // block_size))
        # If you want to handle initialization of arbitrary totals without
        # needing to be careful that block_size evenly divides total, add
        # a single:
        # f.write(fill[:total % block_size])
        # here to write out the incomplete block.

Ideal block size is going to differ from system to system. One reasonable choice would be to use io.DEFAULT_BUFFER_SIZE to match writes to flushes automatically, while still keeping memory usage low.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
1

Your problem is calling python methods way to often (for each byte!). What I offer is surely not perfect, but will work many, many times faster. Try the following:

fill = b"\xFF" * 1024 * 1024  # instantly 1 MiB of ones
...
file_size = 300 * 1024  # in MiB now!
with open(path, 'wb') as f:
    for _ in range(file_size):
        f.write(fill)
Art
  • 2,235
  • 18
  • 34