32

I am working at an OS independent file manager, and I am looking at the most efficient way to copy a file for Linux. Windows has a built in function, CopyFileEx(), but from what I've noticed, there is no such standard function for Linux. So I guess I will have to implement my own. The obvious way is fopen/fread/fwrite, but is there a better (faster) way of doing it? I must also have the ability to stop every once in a while so that I can update the "copied so far" count for the file progress menu.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Radu
  • 923
  • 2
  • 9
  • 21
  • Maybe call system cp command? – Griwes Sep 18 '11 at 19:03
  • Too complicated, I'd have to parse the output of the command to look for errors, and not sure if it has a progress indicator. – Radu Sep 18 '11 at 19:05
  • 2
    Possible duplicate. Have a look at http://stackoverflow.com/questions/3680730/c-fileio-copy-vs-systemcp-file1-x-file2-x - I think it's what you're asking about. – Aleks G Sep 18 '11 at 19:05
  • How do you do the "copied so far" thing in Windows? – pmg Sep 18 '11 at 19:05
  • 1
    CopyFileEx has a callback function which is called every once in a while, and it updates the copied so far amount. – Radu Sep 18 '11 at 19:08
  • Possible duplicate of [How can I copy a file on Unix using C?](http://stackoverflow.com/questions/2180079/how-can-i-copy-a-file-on-unix-using-c) – Ciro Santilli OurBigBook.com May 08 '17 at 12:17
  • 1
    Also see [`copy_file_range(2)`](http://man7.org/linux/man-pages/man2/copy_file_range.2.html) man page. The function requires `GNU_SOURCE`. But the man page also says, *"The copy_file_range() system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space emulation when it is not available."* – jww Mar 18 '19 at 10:58

6 Answers6

39

Unfortunately, you cannot use sendfile() here because the destination is not a socket. (The name sendfile() comes from send() + "file").

For zero-copy, you can use splice() as suggested by @Dave. (Except it will not be zero-copy; it will be "one copy" from the source file's page cache to the destination file's page cache.)

However... (a) splice() is Linux-specific; and (b) you can almost certainly do just as well using portable interfaces, provided you use them correctly.

In short, use open() + read() + write() with a small temporary buffer. I suggest 8K. So your code would look something like this:

int in_fd = open("source", O_RDONLY);
assert(in_fd >= 0);
int out_fd = open("dest", O_WRONLY);
assert(out_fd >= 0);
char buf[8192];

while (1) {
    ssize_t read_result = read(in_fd, &buf[0], sizeof(buf));
    if (!read_result) break;
    assert(read_result > 0);
    ssize_t write_result = write(out_fd, &buf[0], read_result);
    assert(write_result == read_result);
}

With this loop, you will be copying 8K from the in_fd page cache into the CPU L1 cache, then writing it from the L1 cache into the out_fd page cache. Then you will overwrite that part of the L1 cache with the next 8K chunk from the file, and so on. The net result is that the data in buf will never actually be stored in main memory at all (except maybe once at the end); from the system RAM's point of view, this is just as good as using "zero-copy" splice(). Plus it is perfectly portable to any POSIX system.

Note that the small buffer is key here. Typical modern CPUs have 32K or so for the L1 data cache, so if you make the buffer too big, this approach will be slower. Possibly much, much slower. So keep the buffer in the "few kilobytes" range.

Of course, unless your disk subsystem is very very fast, memory bandwidth is probably not your limiting factor. So I would recommend posix_fadvise to let the kernel know what you are up to:

posix_fadvise(in_fd, 0, 0, POSIX_FADV_SEQUENTIAL);

This will give a hint to the Linux kernel that its read-ahead machinery should be very aggressive.

I would also suggest using posix_fallocate to preallocate the storage for the destination file. This will tell you ahead of time whether you will run out of disk. And for a modern kernel with a modern file system (like XFS), it will help to reduce fragmentation in the destination file.

The last thing I would recommend is mmap. It is usually the slowest approach of all thanks to TLB thrashing. (Very recent kernels with "transparent hugepages" might mitigate this; I have not tried recently. But it certainly used to be very bad. So I would only bother testing mmap if you have lots of time to benchmark and a very recent kernel.)

[Update]

There is some question in the comments about whether splice from one file to another is zero-copy. The Linux kernel developers call this "page stealing". Both the man page for splice and the comments in the kernel source say that the SPLICE_F_MOVE flag should provide this functionality.

Unfortunately, the support for SPLICE_F_MOVE was yanked in 2.6.21 (back in 2007) and never replaced. (The comments in the kernel sources never got updated.) If you search the kernel sources, you will find SPLICE_F_MOVE is not actually referenced anywhere. The last message I can find (from 2008) says it is "waiting for a replacement".

The bottom line is that splice from one file to another calls memcpy to move the data; it is not zero-copy. This is not much better than you can do in userspace using read/write with small buffers, so you might as well stick to the standard, portable interfaces.

If "page stealing" is ever added back into the Linux kernel, then the benefits of splice would be much greater. (And even today, when the destination is a socket, you get true zero-copy, making splice more attractive.) But for the purpose of this question, splice does not buy you very much.

Alnitak
  • 334,560
  • 70
  • 407
  • 495
Nemo
  • 70,042
  • 10
  • 116
  • 153
  • Thank you, very useful comment, especially the posix_fadvise and posix_fallocate. Are you sure that reading in small chunks is better than reading in larger chunks? The CPU cache is of course faster, but most of the time will be spent on the I/O stuff, especially on mechanical devices, where if you read less than a cluster size it at a time it might have to wait for the disk to spin again until it can read the next piece. Of course, some OSes might be smarter and read and cache more than you actually tell it to read, but for an unknown OS I would expect that reading in larger pieces is better. – Radu Sep 18 '11 at 21:11
  • 1
    I expect any modern OS to read ahead of what you request. For example, I am certain Linux, Solaris, and the various BSDs do... So yes, if your disks are fast enough that memory is a bottleneck, I am quite certain that smaller blocks will be faster. If you are outrunning the disk anyway, then it does not matter. But large blocks will never be faster than small blocks unless your OS is ludicrously stupid. Optimize for the present and future, not the past :-) – Nemo Sep 18 '11 at 21:30
  • The MacOS X man page for `sendfile()` has the 'output must be a socket' information up top; the Linux man page seems to hide that information a long way down the page. – Jonathan Leffler Sep 18 '11 at 23:02
  • I'd be almost certain that larger blocks would be much better. Keeping the bits in the CPU's cache is probably far less important than minimizing system calls. – Gabe Sep 20 '11 at 01:33
  • 1
    @Gabe: Well, you would be wrong. I have actually benchmarked this. Linux system calls are more than fast enough to amortize over a few kilobytes so that the system call cost is undetectable. Cache and memory effects, on the other hand, are very detectable. I am not talking about a few percent; I am talking about _multiples_. Locality is everything on a modern system, and this becomes more true with every generation of CPU. – Nemo Sep 20 '11 at 04:14
  • @Gabe: It was a while ago and done internally for work. So you can believe me, not believe me, or try it yourself. Or you can just do the math. Linux system call overhead is [less .1 microsecond](http://stackoverflow.com/questions/1860253/what-is-the-overhead-involved-in-a-mode-switch). Divided by 8K equals 1/80 of a nanosecond per byte. `memcpy` is maybe 2 gigabytes per second, or 1/2 nanosecond per byte. This is honestly not even a close call. – Nemo Sep 20 '11 at 04:35
  • 1
    @Gabe: Also, hammering main memory harms the performance of all of your cores, and these days you probably have more than one. So keeping things cache-friendly -- i.e., operating on small blocks -- is good not just for your thread but also for any other threads you might be running. This is borne out both by theory and by my experience. Of course, I could be lying about my experience. – Nemo Sep 20 '11 at 04:41
  • I'm not suggesting that you're lying, just that it's better to qualify your claims with numbers. It's not the system call overhead I'm worried about, it's the time spent actually executing the system calls. If you're writing a 100MB file 8kB at a time, you have to update the file's metadata 12,500 times. Not only does the inode need to be updated, but blocks have to be allocated and metadata has to be logged to the journal (if your filesystem has one). – Gabe Sep 20 '11 at 05:39
  • The drawback of this approach over splice is that your CPU is tied up shuffling bits around instead of being able to do other useful work. Which probably doesn't matter for the OP, but if this is code for a server of some kind then the distinction becomes important. – Eloff Jan 04 '13 at 17:35
  • @Eloff - `splice` is not actually zero-copy for file-to-file, either; it uses the CPU to move just as much data. It does it in half as many instructions, sure. But it is not zero instructions. On "a server of some kind", the memory hammering itself is where the contention will be, because you have lots of cores sharing the same RAM. In short, I seriously doubt you could find any real-world or even synthetic benchmark where you would notice the difference. – Nemo Jan 04 '13 at 18:46
  • @Nemo Where is this one copy? Correct me if I'm wrong, but on any decent system wouldn't the disk handle DMAing from the disk to kernel buffers and from kernel buffers to the disk? What about when there's a dedicated DMA engine (e.g. I/OAT)? So if you use splice between the two kernel buffers then there is no copy of note? By the way, you go too far when you say there's no synthetic or real world case where splice would have a performance advantage for copying from file to file. It will win in most any environment that is CPU bound. – Eloff Jan 05 '13 at 21:40
  • I should also add that if you have lots of cores running your copy in a server at the same time, you've exceeded the L1 cache and it will be a lot slower than splice and friends. Also hyper-threading and your thread being swapped out will trash the L1 cache, which may make your example slower as well. – Eloff Jan 06 '13 at 02:19
  • The current implementation of splice in the Linux kernel performs a copy from one page cache region to another when you splice from file to file. (Unlike file to socket -- i.e. sendfile -- which is true zero-copy). You would have a stronger argument if this were not the case. Check the kernel source code if you do not believe me; it is pretty easy to find. (If I have time, I will add a reference to this answer later.) – Nemo Jan 06 '13 at 03:05
  • 6
    Heads up - in 2.6.33 and later the out_fd can be any file. Check out the manpage. – ldrg Oct 17 '13 at 07:39
  • I am sorry for the question but I want to learn. Check for EINTR is not necessary? Why? – iwtu Jan 08 '15 at 15:13
  • @iwtu: Because my code is sloppy. (That is, you are right.) Also I am using `assert` wrong here. If I have time in the next few days I might update this answer... – Nemo Jan 08 '15 at 19:57
  • Regarding the lack of benchmarks in the comments above, the authors or coreutils `cp` have made one [here](http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ioblksize.h?id=c0a79542fb5c2c22cf0a250db94af6f8581ca342#n23). They found the optimal buffer value to be `128 KB`: `As of May 2014, 128KiB is determined to be the minimium blksize to best minimize system call overhead.` Note that this is for plain `read+write`, they don't try to use `splice` or `sendfile`, which might be faster. – nh2 Feb 09 '17 at 02:25
  • This code is so fragile it breaks as soon as someone builds in a non-debug configuration by defining `-DNDEBUG`. – jww Mar 18 '19 at 09:41
  • 1
    @jww I just edited the code to remove the code that has side effects from inside an `assert` call. It's still not "right", but at least it doesn't suffer that bug any more. – Alnitak Mar 18 '19 at 09:51
  • @jww Yes, I mentioned that two comments prior to yours and over four years ago. Never found time to clean up the code, but all of my points still stand. – Nemo Mar 19 '19 at 02:58
  • 2
    @Nemo please update the answer: https://man7.org/linux/man-pages/man2/sendfile.2.html `In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file` – timotheecour Feb 13 '21 at 20:12
7

If you know they'll be using a linux > 2.6.17, splice() is the way to do zero-copy in linux:

 //using some default parameters for clarity below. Don't do this in production.
 #define splice(a, b, c) splice(a, 0, b, 0, c, 0)
 int p[2];
 pipe(p);
 int out = open(OUTFILE, O_WRONLY);
 int in = open(INFILE, O_RDONLY)
 while(splice(p[0], out, splice(in, p[1], 4096))>0);
Dave
  • 10,964
  • 3
  • 32
  • 54
  • Thanks for the suggestion, but I would like to make it as portable as possible. – Radu Sep 18 '11 at 19:36
  • 2
    @Radu - on recent Linux at least `splice()` is the right answer because file->file copies are no longer supported with `sendfile()` – Flexo Sep 18 '11 at 19:41
  • 3
    Neither `splice` nor `sendfile` are standardized. If performance is extremely important, write an optimized copy of the function for each environment, then fall back on `fread/frwite` from standard c. The glib function probably does something like that. – Dave Sep 18 '11 at 19:46
  • Hmm, so then I guess I will have to have 3 versions, one using open/read/write, one sendfile, and one splice? What kernel version stopped supporting sendfile() for file to file? – Radu Sep 18 '11 at 19:50
5

Use open/read/write — they avoid the libc-level buffering done by fopen and friends.

Alternatively, if you are using GLib, you could use its g_copy_file function.

Finally, what may be faster, but it should be tested to be sure: use open and mmap to memory-map the input file, then write from the memory region to the output file. You'll probably want to keep open/read/write around as a fallback, as this method is limited to the address space size of your process.

Edit: original answer suggested mapping both files; @bdonlan made excellent suggestion in comment to only map one.

Michael Ekstrand
  • 28,379
  • 9
  • 61
  • 93
  • Can you `mmap()` an empty file? – Jonathan Leffler Sep 18 '11 at 19:20
  • 1
    Thanks. I think using sendfile() as proposed here: http://stackoverflow.com/questions/3680730/c-fileio-copy-vs-systemcp-file1-x-file2-x would be better though, as it is done in the kernel. – Radu Sep 18 '11 at 19:29
  • Actually given the complications with `sendfile()` and the requirement for the output to be a socket it seems there is actually far more merit in the simple answer than I realised. – Flexo Sep 18 '11 at 19:46
  • @Jonathan: before copy you allocate empty space in the file, then `mmap`. – Karoly Horvath Sep 18 '11 at 20:31
  • 3
    I wouldn't mmap both files, myself - map the source, then `write()` from the mapped region. You've got to copy it either way, so might as well do it in the kernel and avoid the page faults on the destination. – bdonlan Sep 18 '11 at 20:38
  • @bdonlan Great idea. You can then also use twice as much address space to map the source file, allowing bigger files to be copied before fallback on 32-bit systems. I am updating my answer to reflect this. – Michael Ekstrand Sep 20 '11 at 01:21
  • 1
    Using `mmap` is a good idea, but you still have to do the writes in chunks to be able to show progress. – Gabe Sep 20 '11 at 01:34
1

My answer from a more recent duplicate of this post.

boost now offers mapped_file_source which portably models a memory-mapped file.

Maybe not as efficient as CopyFileEx() and splice(), but portable and succinct.

This program takes 2 filename arguments. It copies the first half of the source file to the destination file.

#include <boost/iostreams/device/mapped_file.hpp>
#include <iostream>
#include <fstream>
#include <cstdio>

namespace iostreams = boost::iostreams;
int main(int argc, char** argv)
{
    if (argc != 3)
    {
        std::cerr << "usage: " << argv[0] << " <infile> <outfile> - copies half of the infile to outfile" << std::endl;
        std::exit(100);
    }

    auto source = iostreams::mapped_file_source(argv[1]);
    auto dest = std::ofstream(argv[2], std::ios::binary);
    dest.exceptions(std::ios::failbit | std::ios::badbit);
    auto first = source. begin();
    auto bytes = source.size() / 2;
    dest.write(first, bytes);
}

Depending on OS, your mileage may vary with system calls such as splice and sendfile, however note the comments in the man page:

Applications may wish to fall back to read(2)/write(2) in the case where sendfile() fails with EINVAL or ENOSYS.

Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
0

I wrote some benchmarks to test this out and found copy_file_range to be the fastest. Otherwise, either use a 128 KiB buffer or use a read-only mmap for the src data and use the write syscall for the dest data.

Article: https://alexsaveau.dev/blog/performance/files/kernel/the-fastest-way-to-copy-a-file
Benchmarks: https://github.com/SUPERCILEX/fuc/blob/fb0ec728dbd323f351d05e1d338b8f669e0d5b5d/cpz/benches/copy_methods.rs


Benchmarks inlined in case that link goes down:

use std::{
    alloc,
    alloc::Layout,
    fs::{copy, File, OpenOptions},
    io::{BufRead, BufReader, Read, Write},
    os::unix::{fs::FileExt, io::AsRawFd},
    path::{Path, PathBuf},
    thread,
    time::Duration,
};

use cache_size::l1_cache_size;
use criterion::{
    criterion_group, criterion_main, measurement::WallTime, BatchSize, BenchmarkGroup, BenchmarkId,
    Criterion, Throughput,
};
use memmap2::{Mmap, MmapOptions};
use rand::{thread_rng, RngCore};
use tempfile::{tempdir, TempDir};

// Don't use an OS backed tempfile since it might change the performance characteristics of our copy
struct NormalTempFile {
    dir: TempDir,
    from: PathBuf,
    to: PathBuf,
}

impl NormalTempFile {
    fn create(bytes: usize, direct_io: bool) -> NormalTempFile {
        if direct_io && bytes % (1 << 12) != 0 {
            panic!("Num bytes ({}) must be divisible by 2^12", bytes);
        }

        let dir = tempdir().unwrap();
        let from = dir.path().join("from");

        let buf = create_random_buffer(bytes, direct_io);

        open_standard(&from, direct_io).write_all(&buf).unwrap();

        NormalTempFile {
            to: dir.path().join("to"),
            dir,
            from,
        }
    }
}

/// Doesn't use direct I/O, so files will be mem cached
fn with_memcache(c: &mut Criterion) {
    let mut group = c.benchmark_group("with_memcache");

    for num_bytes in [1 << 10, 1 << 20, 1 << 25] {
        add_benches(&mut group, num_bytes, false);
    }
}

/// Use direct I/O to create the file to be copied so it's not cached initially
fn initially_uncached(c: &mut Criterion) {
    let mut group = c.benchmark_group("initially_uncached");

    for num_bytes in [1 << 20] {
        add_benches(&mut group, num_bytes, true);
    }
}

fn empty_files(c: &mut Criterion) {
    let mut group = c.benchmark_group("empty_files");

    group.throughput(Throughput::Elements(1));

    group.bench_function("copy_file_range", |b| {
        b.iter_batched(
            || NormalTempFile::create(0, false),
            |files| {
                // Uses the copy_file_range syscall on Linux
                copy(files.from, files.to).unwrap();
                files.dir
            },
            BatchSize::LargeInput,
        )
    });

    group.bench_function("open", |b| {
        b.iter_batched(
            || NormalTempFile::create(0, false),
            |files| {
                File::create(files.to).unwrap();

                files.dir
            },
            BatchSize::LargeInput,
        )
    });

    #[cfg(target_os = "linux")]
    group.bench_function("mknod", |b| {
        b.iter_batched(
            || NormalTempFile::create(0, false),
            |files| {
                use nix::sys::stat::{mknod, Mode, SFlag};
                mknod(files.to.as_path(), SFlag::S_IFREG, Mode::empty(), 0).unwrap();

                files.dir
            },
            BatchSize::LargeInput,
        )
    });
}

fn just_writes(c: &mut Criterion) {
    let mut group = c.benchmark_group("just_writes");

    for num_bytes in [1 << 20] {
        group.throughput(Throughput::Bytes(num_bytes));

        group.bench_with_input(
            BenchmarkId::new("open_memcache", num_bytes),
            &num_bytes,
            |b, num_bytes| {
                b.iter_batched(
                    || {
                        let dir = tempdir().unwrap();
                        let buf = create_random_buffer(*num_bytes as usize, false);

                        (dir, buf)
                    },
                    |(dir, buf)| {
                        File::create(dir.path().join("file"))
                            .unwrap()
                            .write_all(&buf)
                            .unwrap();

                        (dir, buf)
                    },
                    BatchSize::PerIteration,
                )
            },
        );

        group.bench_with_input(
            BenchmarkId::new("open_nocache", num_bytes),
            &num_bytes,
            |b, num_bytes| {
                b.iter_batched(
                    || {
                        let dir = tempdir().unwrap();
                        let buf = create_random_buffer(*num_bytes as usize, true);

                        (dir, buf)
                    },
                    |(dir, buf)| {
                        let mut out = open_standard(dir.path().join("file").as_ref(), true);
                        out.set_len(*num_bytes).unwrap();

                        out.write_all(&buf).unwrap();

                        (dir, buf)
                    },
                    BatchSize::PerIteration,
                )
            },
        );
    }
}

fn add_benches(group: &mut BenchmarkGroup<WallTime>, num_bytes: u64, direct_io: bool) {
    group.throughput(Throughput::Bytes(num_bytes));

    group.bench_with_input(
        BenchmarkId::new("copy_file_range", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    // Uses the copy_file_range syscall on Linux
                    copy(files.from, files.to).unwrap();
                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    group.bench_with_input(
        BenchmarkId::new("buffered", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let reader = BufReader::new(File::open(files.from).unwrap());
                    write_from_buffer(files.to, reader);
                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    group.bench_with_input(
        BenchmarkId::new("buffered_l1_tuned", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let l1_cache_size = l1_cache_size().unwrap();
                    let reader =
                        BufReader::with_capacity(l1_cache_size, File::open(files.from).unwrap());

                    write_from_buffer(files.to, reader);

                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    group.bench_with_input(
        BenchmarkId::new("buffered_readahead_tuned", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let readahead_size = 1 << 17; // See https://eklitzke.org/efficient-file-copying-on-linux
                    let reader =
                        BufReader::with_capacity(readahead_size, File::open(files.from).unwrap());

                    write_from_buffer(files.to, reader);

                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    group.bench_with_input(
        BenchmarkId::new("buffered_parallel", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let threads = num_cpus::get() as u64;
                    let chunk_size = num_bytes / threads;

                    let from = File::open(files.from).unwrap();
                    let to = File::create(files.to).unwrap();
                    advise(&from);
                    to.set_len(*num_bytes).unwrap();

                    let mut results = Vec::with_capacity(threads as usize);
                    for i in 0..threads {
                        let from = from.try_clone().unwrap();
                        let to = to.try_clone().unwrap();

                        results.push(thread::spawn(move || {
                            let mut buf = Vec::with_capacity(chunk_size as usize);
                            // We write those bytes immediately after and dropping u8s does nothing
                            #[allow(clippy::uninit_vec)]
                            unsafe {
                                buf.set_len(chunk_size as usize);
                            }

                            from.read_exact_at(&mut buf, i * chunk_size).unwrap();
                            to.write_all_at(&buf, i * chunk_size).unwrap();
                        }));
                    }
                    for handle in results {
                        handle.join().unwrap();
                    }

                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    group.bench_with_input(
        BenchmarkId::new("buffered_entire_file", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let mut from = File::open(files.from).unwrap();
                    let mut to = File::create(files.to).unwrap();
                    advise(&from);
                    to.set_len(*num_bytes).unwrap();

                    let mut buf = Vec::with_capacity(*num_bytes as usize);
                    from.read_to_end(&mut buf).unwrap();
                    to.write_all(&buf).unwrap();

                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    group.bench_with_input(
        BenchmarkId::new("mmap_read_only", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let from = File::open(files.from).unwrap();
                    let reader = unsafe { Mmap::map(&from) }.unwrap();
                    let mut to = File::create(files.to).unwrap();
                    advise(&from);

                    to.write_all(reader.as_ref()).unwrap();

                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    group.bench_with_input(
        BenchmarkId::new("mmap_read_only_truncate", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let from = File::open(files.from).unwrap();
                    let reader = unsafe { Mmap::map(&from) }.unwrap();
                    let mut to = File::create(files.to).unwrap();
                    advise(&from);
                    to.set_len(*num_bytes).unwrap();

                    to.write_all(reader.as_ref()).unwrap();

                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    #[cfg(target_os = "linux")]
    group.bench_with_input(
        BenchmarkId::new("mmap_read_only_fallocate", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let from = File::open(files.from).unwrap();
                    let reader = unsafe { Mmap::map(&from) }.unwrap();
                    let mut to = File::create(files.to).unwrap();
                    advise(&from);
                    allocate(&to, *num_bytes);

                    to.write_all(reader.as_ref()).unwrap();

                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );

    group.bench_with_input(
        BenchmarkId::new("mmap_rw_truncate", num_bytes),
        &num_bytes,
        |b, num_bytes| {
            b.iter_batched(
                || NormalTempFile::create(*num_bytes as usize, direct_io),
                |files| {
                    let from = File::open(files.from).unwrap();
                    let to = OpenOptions::new()
                        .read(true)
                        .write(true)
                        .create(true)
                        .open(files.to)
                        .unwrap();
                    to.set_len(*num_bytes).unwrap();
                    advise(&from);
                    let reader = unsafe { Mmap::map(&from) }.unwrap();
                    let mut writer = unsafe { MmapOptions::new().map_mut(&to) }.unwrap();

                    writer.copy_from_slice(reader.as_ref());

                    files.dir
                },
                BatchSize::PerIteration,
            )
        },
    );
}

fn open_standard(path: &Path, direct_io: bool) -> File {
    let mut options = OpenOptions::new();
    options.write(true).create(true).truncate(true);

    #[cfg(target_os = "linux")]
    if direct_io {
        use nix::libc::O_DIRECT;
        use std::os::unix::fs::OpenOptionsExt;
        options.custom_flags(O_DIRECT);
    }

    let file = options.open(path).unwrap();

    #[cfg(target_os = "macos")]
    if direct_io {
        use nix::{
            errno::Errno,
            libc::{fcntl, F_NOCACHE},
        };
        Errno::result(unsafe { fcntl(file.as_raw_fd(), F_NOCACHE) }).unwrap();
    }

    file
}

fn write_from_buffer(to: PathBuf, mut reader: BufReader<File>) {
    advise(reader.get_ref());
    let mut to = File::create(to).unwrap();
    to.set_len(reader.get_ref().metadata().unwrap().len())
        .unwrap();

    loop {
        let len = {
            let buf = reader.fill_buf().unwrap();
            if buf.is_empty() {
                break;
            }

            to.write_all(buf).unwrap();
            buf.len()
        };
        reader.consume(len)
    }
}

#[cfg(target_os = "linux")]
fn allocate(file: &File, len: u64) {
    use nix::{
        fcntl::{fallocate, FallocateFlags},
        libc::off_t,
    };
    fallocate(file.as_raw_fd(), FallocateFlags::empty(), 0, len as off_t).unwrap();
}

fn advise(_file: &File) {
    // Interestingly enough, this either had no effect on performance or made it slightly worse.
    // posix_fadvise(file.as_raw_fd(), 0, 0, POSIX_FADV_SEQUENTIAL).unwrap();
}

fn create_random_buffer(bytes: usize, direct_io: bool) -> Vec<u8> {
    let mut buf = if direct_io {
        let layout = Layout::from_size_align(bytes, 1 << 12).unwrap();
        let ptr = unsafe { alloc::alloc(layout) };
        unsafe { Vec::<u8>::from_raw_parts(ptr, bytes, bytes) }
    } else {
        let mut v = Vec::with_capacity(bytes);
        // We write those bytes immediately after and dropping u8s does nothing
        #[allow(clippy::uninit_vec)]
        unsafe {
            v.set_len(bytes);
        }
        v
    };
    thread_rng().fill_bytes(buf.as_mut_slice());
    buf
}

criterion_group! {
    name = benches;
    config = Criterion::default().noise_threshold(0.02).warm_up_time(Duration::from_secs(1));
    targets =
    with_memcache,
    initially_uncached,
    empty_files,
    just_writes,
}
criterion_main!(benches);
SUPERCILEX
  • 3,929
  • 4
  • 32
  • 61
-1

You may want to benchmark the dd command

ennuikiller
  • 46,381
  • 14
  • 112
  • 137