Expanding a temporary slice if more bytes are needed

Question

I'm generating random files programmatically in a directory, at least temporaryFilesTotalSize worth of random data (a bit more, who cares).

Here's my code:

var files []string

    for size := int64(0); size < temporaryFilesTotalSize; {
        fileName := random.HexString(12)
        filePath := dir + "/" + fileName
        file, err := os.Create(filePath)
        if err != nil {
            return nil, err
        }

        size += rand.Int63n(1 << 32) // random dimension up to 4GB
        raw := make([]byte, size)
        _, err := rand.Read(raw)
        if err != nil {
            panic(err)
        }

        file.Write(raw)
        file.Close()
        files = append(files, filePath)
    }

Is there any way I can avoid that raw := make([]byte, size) allocation in the for loop? Ideally I'd like to keep a slice on the heap and only grow if a bigger size is required. Any way to do this efficiently?

Why do you want to? Is the performance overhead of a _single_ memory allocation greater than that of file I/O on your target platform? — Thomas, Feb 03 '22 at 12:33
To reuse the `raw` slice, move it outside the for loop. Inside the loop slice it as you need: `tempRaw := raw[:size]`. Allocate a bigger if `len(raw) < size`. — icza, Feb 03 '22 at 12:52
Also note that if you're worried about performance, you can do this without using an intermediate buffer. `os.File` implements `io.ReaderFrom`, so you could create an `io.LimitedReader` from `rand.Rand` and pass that to `File.ReadFrom()`. — icza, Feb 03 '22 at 13:06
But as @Thomas noted: generating random data and writing that to disk is an order of magnitude slower (at least) than allocating a contiguous memory for buffer. — icza, Feb 03 '22 at 13:38
@icza `ReaderFrom` and `LimitReader` are excellent suggestions. Make it an answer and I'll accept it. It's a WAY better method to do this. — Dean, Feb 03 '22 at 16:36

icza · Accepted Answer · 2022-02-04T08:22:22.193

First of all you should know that generating random data and writing that to disk is at least an order of magnitude slower than allocating a contiguous memory for buffer. This definitely falls under the "premature optimization" category. Eliminating the creation of the buffer inside the iteration will not make your code noticeably faster.

Reusing the buffer

But to reuse the buffer, move it outside of the loop, create the biggest needed buffer, and slice it in each iteration to the needed size. It's OK to do this, because we'll overwrite the whole part we need with random data.

Note that I somewhat changed the size generation (likely an error in your code as you always increase the generated temporary files, since you use the size accumulated size for new ones).

Also note that writing a file with contents prepared in a []byte is easiest done using a single call to os.WriteFile().

Something like this:

bigRaw := make([]byte, 1 << 32)

for totalSize := int64(0); ; {
    size := rand.Int63n(1 << 32) // random dimension up to 4GB
    totalSize += size
    if totalSize >= temporaryFilesTotalSize {
        break
    }

    raw := bigRaw[:size]
    rand.Read(raw) // It's documented that rand.Read() always returns nil error

    filePath := filepath.Join(dir, random.HexString(12))
    if err := os.WriteFile(filePath, raw, 0666); err != nil {
        panic(err)
    }

    files = append(files, filePath)
}

Solving the task without an intermediate buffer

Since you are writing big files (GBs), allocating that big buffer is not a good idea: running the app will require GBs of RAM! We could improve it with an inner loop to use smaller buffers until we write the expected size, which solves the big memory issue, but increases complexity. Luckily for us, we can solve the task without any buffers, and even with decreased complexity!

We should somehow "channel" the random data from a rand.Rand to the file directly, something similar what io.Copy() does. Note that rand.Rand implements io.Reader, and os.File implements io.ReaderFrom, which suggests we could simply pass a rand.Rand to file.ReadFrom(), and the file itself would get the data directly from rand.Rand that will be written.

This sounds good, but the ReadFrom() reads data from the given reader until EOF or error. Neither will ever happen if we pass rand.Rand. And we do know how many bytes we want to be read and written: size.

To our "rescue" comes io.LimitReader(): we pass an io.Reader and a size to it, and the returned reader will supply no more than the given number of bytes, and after that will report EOF.

Note that creating our own rand.Rand will also be faster as the source we pass to it will be created using rand.NewSource() which returns an "unsynchronized" source (not safe for concurrent use) which in turn will be faster! The source used by the default/global rand.Rand is synchronized (and so safe for concurrent use–but is slower).

Perfect! Let's see this in action:

r := rand.New(rand.NewSource(time.Now().Unix()))

for totalSize := int64(0); ; {
    size := r.Int63n(1 << 32)
    totalSize += size
    if totalSize >= temporaryFilesTotalSize {
        break
    }

    filePath := filepath.Join(dir, random.HexString(12))
    file, err := os.Create(filePath)
    if err != nil {
        return nil, err
    }

    if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
        panic(err)
    }

    if err = file.Close(); err != nil {
        panic(err)
    }

    files = append(files, filePath)
}

Note that if os.File would not implement io.ReaderFrom, we could still use io.Copy(), providing the file as the destination, and a limited reader (used above) as the source.

Final note: closing the file (or any resource) is best done using defer, so it'll get called no matter what. Using defer in a loop is a bit tricky though, as deferred functions run at the end of the enclosing function, and not at the end of the loop's iteration. So you may wrap it in a function. For details, see `defer` in the loop - what will be better?

Expanding a temporary slice if more bytes are needed

1 Answers1

Reusing the buffer

Solving the task without an intermediate buffer